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Abstract. We introduce a new model of proteins, which extends and enhances the 
traditional graphical representation by associating a combinatorial object called a 
fatgraph to any protein based upon its intrinsic geometry. Fatgraphs can easily be 
stored and manipulated as triples of permutations, and these methods arc therefore 
amenable to fast computer implementation. Applications include the refinement of 
structural protein classifications and the prediction of geometric and other proper- 
ties of proteins from their chemical structures. 



Introduction 

A "fatgraph" G is a graph in the usual sense of the term together with cyclic orderings 
on the half-edges about each vertex (cf. Section 12.21 for the precise definition). They 
arose in mathematics |25| as the combinatorial objects indexing orbi-cells in a certain 
decomposition of Riemann's moduli space 25, 28] and in physics [4|[29] as index sets for 
the large N limit of certain matrix models. A basic geometric point is that a fatgraph G 
uniquely determines a corresponding surface F(G) with boundary which contains G as a 
deformation retract. Fatgraphs have already proved useful in geometry 13, 151 1211 1^5] . in 
theoretical physics [7] [17], and in modeling RNA secondary structures [26], for example. 

A "protein" P is a linear polymer of amino acids (cf. Section [T] for more precision), and 
their study is a central theme in contemporary biophysics [T] [5] . Our main achievement 
in this paper is to introduce a model of proteins which naturally associates a fatgraph 
G(P) to a protein P based upon the spatial locations of its constituent atoms. The idea is 
that the protein is roughly described geometrically as the concatenation of a sequence of 
planar polygons called "peptide units" meeting at tetrahedral angles at pairs of vertices 
and twisted by pairs of dihedral angles between the polygons. To each peptide unit, 
we associate a positively oriented orthonormal 3-frame and a fatgraph building block, 
and we concatenate these building blocks using these 3-frames in a manner naturally 
determined by the geometry of the Lie group SO(3). There are furthermore "hydrogen 
bonds" between atoms contained in the peptide units, and these are modeled by including 
further edges connecting the building blocks so as to determine a well-defined fatgraph 
G(P) from P. Thus, the fatgraph G(P) derived from the protein P captures the geometry 
of the protein "backbone" and the geometry and combinatorics of the hydrogen bonding 
along the backbone; elaborations of this basic model are also described which capture 
further aspects of protein structure. 
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The key point is that topological or geometric properties of the fatgraph G(P) can 
be taken as properties or "decriptors" of the protein P itself. A fundamental aspect not 
usually relevant in applying fatgraphs is that this construction of G(P) is based on actual 
experimental data about P in which there are uncertainties and sometimes errors as well. 
Furthermore, the notion that the protein P is comprised of atoms at fixed relative spatial 
locations, which is the basic input to our model, is itself a biological idealization of the 
reality that a given protein at equilibrium may have several closely related co-existing 
geometric incarnations. In order that the protein descriptors arising from fatgraphs are 
meaningful characteristics of proteins in light of these remarks, we shall be forced to go 
beyond the usual situation and consider fatgraphs G whose corresponding surfaces F(G) 
are non-orientable. This is easily achieved combinatorially by including in the definition 
of a fatgraph also a coloring of its edges by a set with two elements. 

The desired result of "robust" protein descriptors, i.e., properties of G(P) that do not 
change much under small changes in the relative spatial locations of the atoms constituting 
P, is a key attribute of our construction; for example, the number of boundary components 
and the Euler characteristic of F(G(P)) are such robust invariants, and we give a plethora 
of further numerical and non-numerical examples. Another key point of our construction 
rests on the fact that biophysicists already often associate a graph to a protein P based 
upon its hydrogen and chemical bonding, and our model succeeds in reproducing this 
usual graphical depiction of a protein but now with its enhanced structure as a fatgraph 
G(P), i.e., the graph underlying G(P) is the one usually associated to P in biophysics. 
Furthermore, an important practical point is that fatgraphs can be conveniently stored 
and manipulated on the computer as triples of permutations. 

Since this is a math paper whose central purpose is to introduce fatgraph models of 
proteins, we shall not dwell on biophysical applications but nevertheless feel compelled to 
include here several such applications as follows. Certain proteins decompose naturally 
into "domains", roughly 115,000 of which have so far been determined experimentally 
and categorized into several thousand classes, cf. }24j 1221 1141 111] , Our most basic robust 
descriptors of a domain P are given by the topological types of the surface F(G(P)) 
computed with various thresholds of potential energy imposed on the hydrogen bonds 
(see Section 13.41 for details). We show here that the topological types of F(G(P)) for 
several such potential energy thresholds uniquely determine P among all known protein 
globules. Other such "injectivity results" for globules based on various robust protein 
descriptors are also presented. 

This paper is organized as follows. Section [T] introduces an abstract definition of 
"polypeptides", which give a precise mathematical formulation of the biophysics of a 
protein required for our model; a more detailed discussion of proteins from first principles 
is given in the beautiful book [9], which we heartily recommend. Section [2] introduces the 
notion of fatgraphs required here, whose corresponding surfaces may be non-orientable, 
and contains basic results about them. In particular, a number of results, algorithms, 
and constructions are presented showing that our methods are amenable to fast computer 
implementation. 

Section[3]is the heart of the paper and describes the fatgraph associated to a polypeptide 
structure in detail. Background on SO(3) graph connections is given in Section 13.11 
and this is applied in Section 13.21 where we explain how the fatgraph building blocks 
associated with peptide units are concatenated. Section 13.31 discusses the addition of 
edges corresponding to hydrogen bonds, thus completing the basic construction of the 
fatgraph model of a polypeptide structure. Section [3.41 discusses this basic model and its 
natural generalizations and extensions for proteins and beyond. An alternative description 
of this model, which is more physically transparent but less mathematically tractable, is 
given in Appendix [X] and the standard structural motifs of "alpha helices" and "beta 
strands" are discussed in this alternative model. 
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Robust invariants of fat graphs are denned and studied in Section [4] providing countless 
meaningful new protein descriptors. Section [5] gives the injectivity results mentioned 
above after first discussing certain practical aspects of implementing our methods. Finally, 
Section[6]contams closing remarks including several further biophysical applications of our 
methods which will appear in companions and sequels to this paper. 

1. Polypeptides 

There are 20 amino acid^, 19 of which have the similar basic chemical structure illus- 
trated in Figure II. lb ,, where H, C, N, O respectively denote hydrogen, carbon, nitrogen, 
oxygen atoms, and the residue R is one of 19 specific possible sub-molecules; the one 
further amino acid called Proline has the related chemical structure containing a ring 
CCCCN of atoms illustrated in Figure II. lb . The residue ranges from a single hydro- 
gen atom for the amino acid called Glycine to a sub-molecule comprised of 19 atoms 
for the amino acid called Tryptophan. All 20 amino acids are composed exclusively of 
H, C, N, O atoms, except for the amino acids called Cysteine and Metionine each of which 
also contains a single sulfur atom. 




a) Typical amino acid b) Proline 



Figure 1.1. Chemical structure of amino acids 

In either case of Figure 11.11 the sub-molecule COOH depicted on the right-hand side 
is called the carboxyl group, and the NH2 depicted on the left-hand side in Figure fTTTa or 
the NHC on the left-hand side in Figure [TTTb is called the amine group. The carbon atom 
bonded to the carboxyl and amine groups is called the alpha carbon atom of the amino 
acid, and it is typically denoted C a . The alpha carbon atom is bonded to exactly one 
further atom in the residue, either a hydrogen atom in Glycine or a carbon atom, called 
the beta carbon atom, in all other cases. 

As illustrated in Figure ITT21 a sequence of L amino acids can combine to form a polypep- 
tide, where the carbon atom from the carboxyl group of ith amino acid forms a peptide 
bond with the nitrogen atom from the amine group of the (i + l)st amino acid together 
with the resulting condensation of a water molecule comprised of an OH from the carboxyl 
group of the former and an H from the amine group of the latter, for i = 1,2, . . . , L — 1. 
The nature of this peptide bond and the accuracy of the implied geometry of Figure 11.21 
will be discussed presently, and the further notation in the figure will be explained later. 

The primary structure of a polypeptide is the ordered sequence Ri, R.%, . . . , Rl of 
residues or of amino acids occurring in this chain, i.e., a word in the 20-letter alpha- 
bet of amino acids of length L, which ranges in practice from L — 3 to L ~ 30, 000. The 



1 Strictly speaking, these 20 molecules arc the "standard gcne-cncodcd" amino acids, i.e., those 
amino acids determined from RNA via the genetic code; in fact, there are a few other non-standard 
genc-encoded amino acids which are relatively rare in nature and which wc shall ignore here. 



4 R. C. PENNER, MICHAEL KNUDSEN, CARSTEN WIUF, AND J0RGEN ELLEGAARD ANDERSEN 



H 




Figure 1.2. A polypeptide 



carbon and nitrogen atoms which participate in the peptide bonds together with the alpha 
carbon atoms form the backbone of the polypeptide, which is described by 

-^Vi C\ Ci -/V2 C2 C2 Ni-CT-Ci N L - Cl - Cl, 

indicating the standard enumeration of atoms along the backbone. The first amine nitro- 
gen atom and the the last carboxyl carbon atom, respectively, are called the N and C 
termini of the polypeptide. 

The ith peptide unit, for i = 1, 2, . . . , L — 1, is comprised of the consecutively bonded 
atoms Cf — Ct — Ni+i — C"? +1 in the backbone together with the oxygen atom Oi from 
the carboxyl group bonded to d and one further atom, namely, the remaining hydrogen 
atom Hi+i of the amine group except for Proline, for which the further atom is the carbon 
preceding the nitrogen of the amine group in the Proline ring. 

This describes the basic chemical structure of a polypeptide, where the further physico- 
chemical details about residues, for example, can be found in any standard text and will 
not concern us here. 

There are several key geometrical facts about polypeptides as follows, where we refer 
to the center of mass of the Bohr model of a nucleus as the "center" of the atom and 
to the line segment connecting the centers of two chemically bonded atoms as the "bond 
axis" . 

Fact 1.1. For any polypeptide, there are the following geometric constraints: 
Fact A each peptide unit is planar, i.e., the centers of the six constituent atoms of the 
peptide unit lie in a plane, and furthermore, the angles between the bond axes in a peptide 
unit are always fixed at 120 degrees; 

Fact B at each alpha carbon atom Cf , the four bond axes (to hydrogen, d,Ni and to 
the residue, i.e., to the hydrogen atom of Glycine or to the beta carbon atom in all other- 
cases) are tetrahedrafy 

Fact C in the plane of each peptide unit, the centers of the two alpha carbons occur on op- 
posite sides of the line determined by the bond axis of the peptide bond, except occasionally 
for the peptide unit preceding Proline. 



Another geometric constraint on any gene-encoded protein is that when viewed along the bond axis 
from hydrogen to C", the bond axes occur in the cycle ordering corresponding to d, residue, Ni. This 
imposes various chiral constraints on proteins but plays no role in our basic fatgraph model. 
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We must remark immediately that these geometric facts are only effectively true, that 
is, the peptide unit is almost planar and the angles between bond axes in a peptide unit 
are nearly 1 20 degrees for example in Fact A; thus, the depiction in Figure 11.21 of the 
peptide unit is nearly geometrically accurate. In nature, thermal and other fluctuations 
do slightly affect the geometric absolutes stated in Fact ll.ll but we shall nevertheless take 
these facts as geometric absolutes in constructing our model. 

Fact A is fundamental to our constructions, and it arises from purely quantum effects: 
the planar character is provided by the "sp 2 hybridization" of electrons in the d and Ni+i 
atoms in the ith peptide unit, and the peptide unit is rigid because of additional bonding 
with Oi of the two p-electrons from d, iVi+i not involved in the sp 2 hybridization. This 
complexity of shared electrons is why the peptide bond and the bond between d and 
Oi are often drawn as "partial double bonds" as in Figure 11.21 In contrast, Fact B is 
a standard consequence of the valence of carbon atoms in the Bohr model absent any 
quantum mechanical hybridization of electrons. 

As a point of terminology, Fact C expresses that except for Proline, the peptide unit 
occurs in what is called the "trans-conformation", and the complementary possibility 
(with the centers of the alpha carbon atoms in a peptide unit on the same side of the 
line determined by the axis of the peptide bond) is called the "cis-conformation" . This 
geometric constraint follows from the simple fact that in the cis-conformation, the two 
"large" alpha carbon atoms in the peptide unit would be so close together as to be ener- 
getically unfavorable. In contrast for cis-Proline, the two conformations are comparable 
since in either case, two carbons (either the two alpha carbons or one alpha and the beta 
carbon in the Proline ring) must be close together; nevertheless, cis-Proline, as opposed to 
trans- Proline, occurs only about ten percent of the time in nature since the latter is still 
somewhat energetically favorable. This exemplifies a general trend: somewhat energeti- 
cally unfavorable conformations do occur but more rarely than favorable ones, and very 
energetically unfavorable conformations do not occur at all. 

The mechanism underlying Fact C is that atoms cannot "bump into each other", or 
more precisely, their centers cannot be closer than their van der Waals radii allow, and 
this is called a steric constraint, which will be pertinent to subsequent discussions. 

Facts A and B together indicate the basic geometric structure of a polypeptide: a 
sequence of planar peptide units meeting at tetrahedral angles at the alpha carbon atoms; 
these planes can rotate rather freely about the axes of these tetrahedral bond axes, and 
this accounts for the relative flexibility of polypeptides. For a polypeptide at equilibrium 
in some environment, the dihedral angle along the bond axis of Ni — Cf (and C" — d) 
between the bond axis of d~\ = Ni (and Ni — C") and the bond axis of Cf — d (and 
Ci = Ni+i) is called the conformational angle ifii (and 4>i respectively); see Figure \T?I\ 
Illustrating the physically possible pairs (ifii,ipi) £ S 1 x S 1 , steric constraints for each 
amino acid can be plotted in what is called a Ramachandran plot; in particular, for any 
polypeptide at equilibrium in any environment, ipi is bounded away from zero because of 
steric constraints involving d-i and d. 

This completes our discussion of the intrinsic physico-chemical and geometric aspects 
of polypeptides underlying our model. The remaining such aspect of importance to us 
depends critically upon the ambient environment in which the polypeptide occurs. 

An electronegative atom is one that tends to attract electrons, and examples of such 
atoms include C, N, O in this order of increasing such tendency. When an electronegative 
atom approaches another electronegative atom which is chemically bonded to a hydrogen 
atom, the two electronegative atoms can share the electron envelope of the hydrogen atom 
and attract one another through a hydrogen bond. A hydrogen bond has a well-defined 
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potential energy determined on the basis of electrostatics which can be compute from the 
spatial locations of its constituent atoms and the physical properties of its environment. 

For example, the Oi or Ni+i — Hi+i in one peptide unit can form a hydrogen bond 
with the Nj+i — Hj+i or Oj in another peptide unit, respectively, where i 7^ j owing to 
rigidity and fixed lengths of 1.3-1.6 Angstroms of bond axes. For another example, many 
of the remarkable properties of water arise from the occurrence of hydrogen bonds among 
HOH and OH2 molecules. The absolute potential energy of hydrogen bonds is rather 
large, so a polypeptide in a given environment seeks to saturate as many hydrogen bonds 
as possible subject to steric and other physico-chemical and geometric constraints. For 
example in an aqueous environment, the oxygen and nitrogen atoms in the peptide units 
of a polypeptide might form hydrogen bonds with one another or with the ambient water 
molecules of their environment, and there may also occur hydrogen bonding involving 
atoms comprising the residues or the alpha carbons. 

Suppose that a polypeptide is at equilibrium, i.e., at rest, in some environment. Its 
tertiary structure in that environment is the specification of the spatial coordinates of the 
centers of all of its constituent atoms. Furthermore, fix some energy cutoff and regard 
a pair Oi and Nj of backbone atoms as being hydrogen bonded if the potential energy 
discussed above is less than this energy cutoff; a standard convention is to take the energy 
cutoff to be -0.5 kcal.mokfl The secondary structure of the polypeptide at equilibrium 
in an environment is the specification of hydrogen bonding as determined by an energy 
cutoff among its constituent backbone atoms Oi and Nj, for i, j = 1, 2, . . . , L. 

Certain polypeptides occur as the "proteins" which regulate and effectively define life as 
we know it. The collective knowledge of protein primary structures is deposited in the man- 
ually curated SWISS-PROT data bank 2], which contains about 400,000 distinct entries, 
and the computer curated UNI-PROT data bank 30 , which contains about 6,000,000 en- 
tries. These data are readily accessible at www.ebi.ac.uk/swissprot and www.uniprot.org 
respectively. The collective knowledge of protein tertiary structure is deposited in the 
Protein Data Bank (PBD) [3], which contains roughly 55,000 proteins at this moment, 
where the atomic locations of each of the constituent atoms of each of these proteins 
is recorded; each entry in the PDB, i.e., each protein, thus comprises a vast amount of 
data. Atomic locations in PDB should be taken with an experimental uncertainty of 0.2 
Angstroms, and the conformational angles tp, tp computed from them should be taken with 
an experimental uncertainty of 15-20 degrees; however, the unit displacement vectors of 
bond axes along the backbone, upon which our model is based, are substantially better 
determined [TO] . 

Upon postulating definitions of the various secondary structure elements in terms of 
properties of the atomic locations, protein secondary structure can be calculated from ter- 
tiary structure. A standard such method is called the Dictionary of Secondary Structures 
for Proteins (DSSP) [TB], and proprietary software for these calculations and DSSP files 
for each PDB entry can be found at swift. cmbi.ru. nl/gv/dssp. Hydrogen bond strengths 
and various conformational angles are also output as part of the calculations of DSSP. 



■^For instance in the standard method called DSSP 16! where rxY denotes the distance between the 
centers of atoms X, Y £ N, O} in Angstroms and the location of H is determined from idealized 
geometry and bond lengths in practice, the assignment of potential energy to the hydrogen bond between 
O and NH in a water environment is given by 9i(?2{^q] v 4- r cli ~ t oh ~ r CN 1 * ^32 kcal/mole, where 
q\ — 0.42 and = 0.20 based on the respective assignment of partial charges —0.42c and +0.20e to 
the carboxyl carbon and amine nitrogen with e representing the election charge. 

^Other methods 1181 1191 of determining hydrogen bonds are also employed. 

^This is a slight abuse of terminology as biologists might call this rather "super secondary structure" ; 
we shall explain this distinction further when it is appropriate. 
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2. Fatgraphs 



2.1. Surfaces. According to the classification of surfaces [20], a compact and connected 
surface F is uniquely determined up to homeomorphism by the specification of whether 
it is orientable together with its genus g — g(F) and number r — r(F) of boundary 
components, or equivalently, by either g or r and its Euler characteristic 



so the formula x — 2 — 2g*— r holds in either case. 

Recall |20) that the orientation double cover of a surface F is the oriented surface F 
together with the continuous map p : F — > F so that for every point x £ F there is a disk 
neighborhood U of a; in F, where p~ l (U) consists of two components on each of which 
p restricts to a homeomorphism and where the further restrictions of p to the boundary 
circles of these two components give both possible orientations of the boundary circle of 
U. Such a covering p : F — *• F always exists, and its properties uniquely determine F up 
to homeomorphism and p up to its natural equivalence. In particular, if F is connected 
and orientable, then F has two components with opposite orientations, each of which is 
identified with F by p. Furthermore provided F is connected, F is non-orientable if and 
only if F is connected, and a closed curve in F lifts to a closed curve in F if and only if a 
neighborhood of it in F is homeomorphic to an annulus. 

2.2. Fatgraphs and their associated surfaces. Consider a finite graph G in the usual 
sense of the term comprised of vertices V = V(G) and edges E — E(G), which do not 
contain their endpoints and where an edge is not necessarily uniquely determined by its 
endpoints, or in other words, G is a finite one-dimensional CW complex. Our standard 
notation will be v = v(G) = #V and e = e(G) — #E, where #X denotes the cardinality of 
a set X. To avoid cumbersome cases in what follows, we shall assume that no component 
of G consists of a single vertex or a single edge with distinct endpoints. Removing a 
single point from each edge produces a subspace of G, each component of which is called 
a half-edge. A half-edge which contains u g V in its closure is said to be incident on u, 
and the number of distinct half-edges incident on u is the valence of u. 

A fattening on G is the specification of a cyclic ordering on the half-edges incident on 
it for each u 6 V, and an X -coloring on G is a function E — > X, for any set X. 

A fatgraph G is a graph endowed with a fattening together with a coloring by a set with 
two elements, where we shall refer to the two colors on edges as "twisted" and "untwisted" . 
A fatgraph G uniquely determines a surface F(G) with boundary as follows. 

Construction 2.1. For each vertex u £ V of G of valence k > 2, consider an oriented 
surface diffeomorphic to a polygon P u of 2k sides containing in its interior a single vertex 
of valence k each of whose incident edges are also incident on a univalent vertex contained 
in alternating sides of P u , which are identified with the half-edges of G incident on u 
so that the induced counter-clockwise cyclic ordering on the boundary of P u agrees with 
the fattening of G about u; for a vertex u of valence k = 1, the corresponding surface 
P u contains u in its boundary; see Figure 12.11 The surface F(G) is the quotient of the 
disjoint union U„ 6 yP„, where the frontier edges, which are oriented with the polygons 
on their left, are identified by a homeomorphism if the corresponding half-edges lie in a 
common edge of G, and this identification of oriented segments is orientation-preserving 
if and only if the edge is twisted. The graphs in the polygons P u , for u 6 V, combine to 
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Figure 2.1. The polygon P u associated with a vertex it 



give a fatgraph embedded in F(G) with its univalent vertices in the boundary, which is 
identified with G in the natural way so that we regard G C F(G). 

Our standard notation will be to set 

r(G) = r(F(G)) = the number of boundary components of F(G), 
g*(G) = g*(F(G)) = the modified genus of F(G). 

ft is often convenient to regard a fatgraph more pictorially by considering the planar 
projection of a graph embedded in three-space, where the cyclic ordering is given near 
each vertex by the counter-clockwise ordering in the plane of projection and edges can 
be drawn with arbitrary under/over crossings; we also depict untwisted edges as ordinary 
edges and indicate twisted edges with an icon x , or more generally, take this as defined 
modulo two so that an even number of icons x represents an untwisted edge and an odd 
number represents a twisted edge. Several examples of fatgraphs and their corresponding 
surfaces are illustrated in Figure l2~2l where the bold lines indicate the planar projection of 
the fatgraph, the dotted lines indicate the gluing along edges of polygons, and the further 
notation in the figure will be explained later. 




Figure 2.2. The surface associated to a fatgraph 



The graph G is evidently a strong deformation retract of F(G), so the Euler charac- 
teristic is x(F(G)) = x(G) = v(G) — e(G), and the boundary components of F(G) are 
composed of the frontier edges of U„ 6 vP„ which do not correspond to half-edges of G. 

Proposition 2.2. Suppose that G is a fatgraph and X, Y C E(G) are disjoint collections 
of edges. Change the color, twisted or untwisted, of the edges in X and delete from G 
the edges in Y to produce another fatgraph G' , whose cyclic orderings on half-edges are 
induced from those on G in the natural way. Then \r(G) — r(G')\ < #X + #Y. 



FATGRAPH MODELS OF PROTEINS 



9 



Proof. By the triangle inequality, it suffices to treat the case that X UY = {/}, and we 
set r = r(G). If / € E(G) is incident on a univalent vertex, then neither changing the 
color of nor deleting / alters r, so we may assume that this is not the case. Consider 
an arc a properly embedded in F(G) meeting / in a single transverse intersection and 
otherwise disjoint from G. Rather than changing the color on / to produce G' , let us 
instead cut F(G) along a and then re-glue along the two resulting copies of a reversing 
orientation to produce a surface homeomorphic to F(G'). If the endpoints of a occur in 
a common boundary component of F(G), then the change of color on / either leaves r 
invariant or increases it by one, and if they occur in different boundary components, then 
the change of color on / necessarily decreases r by one. For the remaining case, rather 
than removing the edge / to produce G', let us instead consider cutting F(G) along a 
to produce a surface homeomorphic to F(G'). If the endpoints of a occur in the same 
boundary component of F(G), then cutting on a either leaves r invariant or increases it 
by one, and if they occur in different boundary components, then the cut on a decreases 
r by one. □ 

We say that a fatgraph G is untwisted if all of its edges are untwisted, and this is 
evidently a sufficient but not a necessary condition for F(G) to be orientable. 

Remark 2.3. Suppose that G is an untwisted fatgraph. Let us emphasize that the genus 
of F(G) is not the classical genus of the underlying graph, i.e., the least genus orientable 
surface in which the underlying graph can be embedded. Rather, the classical genus of the 
underlying graph is the least genus of an orientable surface F(G) arising from all possible 
fattenings on the underlying graph. 

We say that two fatgraphs Gi and G2 are strongly equivalent if there is an isomorphism 
of the graphs underlying Gi and G2 that respects the cyclic orderings and preserves the 
coloring and that they are equivalent if there is a homeomorphism from F(G\) to F(G2) 
that maps Gi C F(Gi) to G2 C F(G2). It is clear that strong equivalence implies equiv- 
alence and that equivalence implies that the corresponding surfaces are homeomorphic; 
neither converse holds in general. 

Given a vertex u of G, define the vertex flip of G at u by reversing the cyclic ordering 
on the half-edges incident on u and adding another icon x to each half-edge incident on 
u. In particular, a vertex flip on a univalent vertex simply adds an icon x to the edge 
incident upon it. 

Proposition 2.4. Two untwisted fatgraphs are equivalent if and only if they are strongly 
equivalent. Two arbitrary fatgraphs Gi and G2 are equivalent if and only if there is a third 
fatgraph G which arises from Gi by a finite sequence of vertex flips so that G and G2 are 
strongly equivalent. In particular, if G arises from Gi by a vertex flip, then G and Gi are 
equivalent. 

Proof. In case Gi and G2 are untwisted, a homeomorphism from F(Gi) to F(G2) mapping 
Gi to G2 restricts to a strong equivalence of Gi and G2, and the converse follows by 
construction in any case, as already observed, thus proving the first assertion. 

The third assertion follows since a flip on a vertex u of Gi corresponds to simply re- 
versing the orientation of the polygon P u in the construction of F(G), i.e., in our graphical 
depiction, removing the neighborhood of u from the plane of projection, turning it up- 
side down in three-space, and then replacing it in the plane of projection at the expense 
of twisting one further time each incident half-edge of G; this evidently extends to a 
homeomorphism of F(Gi) to F(G) which maps Gi to G but does not preserve coloring. 

Since strong equivalence implies equivalence by construction and equivalence of fat- 
graphs is clearly a transitive relation, if there is such a fatgraph G as in the statement 
of the proposition, then Gi and G2 are indeed equivalent. For the converse, we may and 
shall assume that Gi and G2 are connected. 
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Consider a fat graph G with v vertices and e edges, and choose a maximal tree T of 
G. There are 1 — x{G) = 1 — v + e edges in G — T since we may collapse T to a point 
without changing v — e, which is therefore the Euler characteristic of the collapsed graph 
comprised of a single vertex and one edge for each edge of G — T. 

We claim that there is a composition of flips of vertices in G that results in a fatgraph 
with any specified twisting on the edges in T. To see this, consider the collection of all 
functions from the set of edges of G to Z/2, a set with cardinality 2 e . Vertex flips act on 
this set of functions in the natural way, and there are evidently 2" possible compositions 
of vertex flips. The simultaneous flip of all vertices of G acts trivially on this set of 
functions and corresponds to reversing the cyclic orderings at all vertices, so only 2" _1 
such compositions may act non-trivially. Insofar as 2 e /2"~ 1 = 2 1 ~" +e and there are 
1 — v + e edges of G — T by the previous paragraph, the claim follows. 

Finally, suppose that Gi and G2 are equivalent and let <fr : F(Gi) — > F{G2) be a 
homeomorphism of surfaces that restricts to a homeomorphism of Gi to G2 ■ Performing 
a vertex flip on Gi and identifying edges before and after in the natural way produces a 
fatgraph in which T is still a maximal tree and which is again equivalent to G2, according 
to previous remarks, by a homeomorphism still denoted <f>, which maps T to the maximal 
tree (f>(T) C Gi. By the previous paragraph, we may apply a composition of vertex flips 
to G\ to produce a fatgraph G so that an edge of the maximal tree T C G is twisted if 
and only if its image under cf> is twisted. 

Adding an edge of G — T to T produces a unique cycle in G, and a neighborhood of this 
cycle in F(G) is either an annulus or a Mobius band with a similar remark for edges of 
G2 — 4>(T)- Since (j> restricts to a homeomorphism of the corresponding annuli or Mobius 
bands in F(G) and F(G2), an edge of G — T is twisted if and only if its image under 4> is 
twisted. It follows that G and G2 are strongly equivalent as desired. □ 



2.3. Fatgraphs and permutations. We shall adopt the standard notation for a permu- 
tation on a set S writing (si, . . . , Sk) for the cyclic permutation si 1— » S2 1— » • • • 1— » Sfc 1— » si 
on distinct elements s\,...,Sk £ S, called a transposition if k = 2, and shall compose 
permutations a, t on S from right to left, so that a o r(s) = a(r(s)). An involution is a 
permutation r so that tot = Is, where Is denotes the identity map on S. Two per- 
mutations are disjoint if they have disjoint supports, so disjoint permutations necessarily 
commute. 

Fix a fatgraph G. A stub of G is a half-edge which is not incident on a univalent vertex 
of G. There are exactly two non-empty connected fatgraphs with no stubs, namely, the 
two we have proscribed consisting of a single vertex with no incident half-edges and a 
single edge with distinct endpoints. 

A fatgraph G determines a triple (a(G),T u (G),T t (G)) of permutations on its set S = 
S(G) of stubs as follows. 

Construction 2.5. For each vertex u of G of valence k > 2 with incident stubs si , . . . , s fe („) 
in a linear ordering compatible with the cyclic ordering given by the fattening on G, con- 
sider the cyclic permutation (si, . . . , Sfc( u )). By construction, the cyclic permutations 
corresponding to distinct vertices of G are disjoint. The composition 



c(G) = Yi ( Sl ' • • • > s fc(X>) 

{vertices ia£V: 
u has valcncc>2} 
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is thus well-defined independent of the order in which the product is taken and likewise 
for the compositions of transpositions 

t v (g)= n (h,h'), 

{pairs of distinct stubs h,h contained 
in some untwisted edge of G } 

n(G)= n (h,ti). 

{pairs of distinct stubs h,h contained 
in some twisted edge of G} 

Notice that o(G) has no fixed points because we have taken the product over vertices 
of valence at least two, and t u (G) and i~t(G) are disjoint involutions whose fixed points 
are the stubs corresponding to the univalent vertices of G. 

For example, enumerating the stubs of the fatgraphs Gi,G2,Gs as illustrated in Fig- 
ure [2721 we have: 

ct(Gi) = ct(G 2 ) = a(G a ) = (1, 2, 3)(4, 5, 6)(7, 8, 9), 
r„(Gi) = (2,8)(3,6)(4,7)(5,9),r t (Gi) = l s , 
Tu(G 2 ) = (2, 8)(3,6)(4,9)(5, 7), n(G 2 ) = Is, 
r u (G 3 ) = (2, 8) (3, 6) (5, 9), r t (G 3 ) = (4, 7). 

Remark 2.6. There is another treatment of fatgraphs as triples of permutations on the 
set of all half-edges instead of stubs, where the univalent vertices are expressed as fixed 
points of the analogue of a. Moreover, there is a transposition in the analogue of r u o r t 
corresponding to each half-edge, but the formulation we have given here, which treats 
univalent vertices as "endpoints of half-edges rather than endpoints of edges", does not 
require these additional transpositions. Since our model will have a plethora of univalent 
vertices, we prefer the more "efficient" version described above, which is just a notational 
convention for permutations. 

Define a labeling on a fatgraph G with N stubs to be a linear ordering on its stubs, i.e., 
a bijection from the set of stubs of G to the set {1, 2, . . . , N}. 

Proposition 2.7. Fix some natural number N > 2. The map G i— ► (a(G),T u (G),Tt(G)) 
of Construction ] 2. 51 induces a bijection between the set of strong equivalence classes of fat- 
graphs with N stubs and the set of all conjugacy classes of triples (a, T u ,Tt) of permutations 
on N letters, where a is fixed point free and r u and Tt are disjoint involutions. 

Proof. The assignment G i—> (a(G),r u (G),Tt(G)) induces a mapping from the set of la- 
beled fatgraphs with N stubs to the set of triples of permutations on {1,2, ...,7V} in 
the natural way. This induced mapping has an obvious two-sided inverse, where the la- 
beled fatgraph is constructed directly from the triple of permutations; we are here using 
our convention that no component of G is a single vertex or a single edge with distinct 
univalent endpoints. A strong equivalence of labeled fatgraphs induces a bijection of 
{1, 2, . . . , N} which conjugates their corresponding triples of permutations to one another 
and conversely, so the result follows. □ 

Construction 2.8. Suppose that G is a fatgraph with triple (a, r«,Tt) of permutations 
on its set S of stubs determined by Construction 12 . 5l Construct a new set S — {s : s £ S} 
and a new permutation a on S where there is one fc-cycle (st, . . . , si) in a for each k- 
cycle (si, . . . , Sfe) in a. Construct from t u a new permutation f u on S, where there is one 
transposition (si, S2) in f u for each transposition (si, S2) in t u , and construct yet another 
new permutation ft on S U S from r t , where there are two transpositions (si,S2) and 
(si, S2) in ft for each transposition (si, S2) in r t . Finally, define permutations on SU S by 

a' — a o a, 

r = t u ° fu ft, 
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where the order of composition on the right-hand side is immaterial because the permu- 
tations are disjoint in each case. 

Proposition 2.9. Suppose that Construction ^.^ assigns the triple [a, T u ,Tt) of permuta- 
tions to the fatgraph G with set S of stubs, let a' , r' be determined from them according to 
Construction \2.3l and consider the untwisted fatgraph G' determined by Construction ^. 5l 
from the triple (<t',t' , l SUi g). Then F(G') is the orientation double cover of F(G), and the 
covering transformation is described by s <-> s. In particular provided F(G) is connected, 
F(G') is connected if and only if F(G) is non-orientable. Furthermore, there is a one-to- 
one correspondence between the boundary components of F(G') and the orientations on 
the boundary components of F(G), i.e., F(G') has twice as many boundary components 
as F(G). 

Proof. The surface F(G') has the required properties of the orientation double cover 
by construction, so the first two claims follow from the general principles articulated in 
Section 12.11 Since each boundary component of F(G) evidently has a neighborhood in 
F(G) homeomorphic to an annulus, the final assertion follows as well. □ 

Proposition 2.10. Adopt the hypotheses and notation of Proposition [2~V\ and consider 
the composition p' = a' o r' . 

i) The orientations on the boundary components of F(G) are in one-to-one correspondence 
with the cycles of p' . More explicitly, suppose that sjsfs^sl ■ ■ ■ s n s n is the ordered sequence 
of stubs traversed by an oriented edge-path in G representing a boundary component of 
F(G) with some orientation, where s], s? are contained in a common edge ofG and perhaps 

sj = s 2 j if they are contained in an edge incident on a univalent vertex, for j = 1, 2, , n. 

Erasing the bars on elements from the corresponding cycle of p' produces the sequence 
(sf, s|, . . . , s^) of stubs of G serially traversed by the corresponding oriented boundary 
component of F(G), called a reduced cycle of p' . 

ii) There is the following algorithm to determine whether G is connected in terms of the 
associated triple (cr, T u ,Tt) of permutations. For any linear ordering on S, let X be the 
subset of S in the reduced cycle of p' containing the first stub. (*) If X = S, then G is 
connected, and the algorithm terminates. If X 7^ S, then consider the existence of a least 
stub s G X — S so that r(s) £ X . If there is no such stub s, then G is not connected, 
and the algorithm terminates. If there is such a stub s, then update X by adding to it the 
subset of S in the reduced cycle of p' containing s. Go to (*). 

Proof. Let us first consider the case that r t = lguSi i- e -i G ls untwisted, and set r = t u . 

For the first part, consider a stub s of G and the effect of a o r on s. The stub s is 
contained in an edge incident on a univalent vertex if and only if s is a fixed point of 
r by construction, and <r(s) = <t(t(s)) in this case is the stub following s in the cyclic 
ordering at the non-univalent endpoint of this edge. In the contrary case that s is not a 
fixed point of r, the stubs s and t(s) are half-edges contained in a common edge of G, and 
s,t(s),p(s) = ct(t(s)) is likewise a consecutive triple of stubs occurring in an edge-path 
of G corresponding to a boundary component of F(G) oriented with F(G) on its left. It 
follows that a cycle of a o r is comprised of every other stub traversed by an edge-path in 
G which corresponds to a boundary component of F(G) oriented in this way, proving the 
first part. 

For the second part, the collection of stubs in X always lies in a single component of 
G in light of the previous remarks, so if at some stage of the algorithm X — S, then G is 
indeed connected. If at some stage of the algorithm there is no stub s with r(s) G X, then 
X is comprised of all the stubs in some component of G in light of the previous discussion, 
so X 7^ S in this case implies that G has at least two components. 
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Turning now to the general case, F(G') is the orientation double cover of F(G), and 
the induced projection map on stubs just erases the bars by Proposition 12.91 The proof 
in this case is therefore entirely analogous. □ 

To exemplify these constructions and results for the fat graphs illustrated in Figure [2~2l 
we find 

<t(Gi)ot„(Gi) = (5,7)(3,4,8)(1,2,9,6), 

<r{G 2 ) o r u (G 2 ) = (1, 2, 9, 5, 8, 3, 4, 7, 6). 
Thus, r(Gi) = 3 and r(Gi) = 1, and since x(Ci) = x(G2) = — 1, the (modified) genera 
are g*(Gi) = and g* (G2) = 1- 




FlGURE 2.3. Example of the orientation double cover 

As to G3, according to Construction 12.81 and Proposition 12.91 the permutations for the 
orientation double cover are given by 

a' = (1, 2, 3)(4, 5, 6)(7, 8, 9)(3, 2, I)(6, 5, 4)(9, 8, 7), 

t' = (2, 8)(3, 6)(5, 9)(2, 8)(3, 6)(5, 9)(4, 7)(4, 7). 

The untwisted fatgraph G' 3 corresponding to (cr', r', 1s(g 3 )us(g 3 )) i s illustrated in Fig- 
ure [2T3l and it is connected reflecting the fact that _F(Gs) is non-orientable. The cycles of 
p — a' o t' are given by 

(1, 2, 9, 6), (I, 3, 5, 8), and (2, 7, 5, 7, 6), (3, 4, 9, 4, 8) 

corresponding to the oriented boundary cycles of G' 3 , and the reduced cycles of p' are 
therefore 

(1, 2, 9, 6), (1, 3, 5, 8), and (2, 7, 5, 7, 6), (3, 4, 9, 4, 8), 
each pair corresponding to the two orientations of a single boundary component of F(Gs). 
It follows that r(Gz) = 2 and thus g*{G3) = 1/2 since again x(G-i) = — 1. 

2.4. Fatgraphs on the computer. Given a linear ordering on the vertices of a fatgraph, 
we may choose an a priori labeling on it that is especially convenient, where the stubs 
about a fixed vertex are consecutive and the stubs about each vertex precede those of 
each succeeding vertex as in Figure 12.21 Owing to Proposition 12.71 the strong equivalence 
class of a fatgraph G with set S of stubs can conveniently be stored on the computer as 
a triple (a, t u , rt) of permutations on the labels {1, 2, . . . , #S} of stubs. The number of 
non-univalent vertices of G is the number of disjoint cycles in a, the number of edges of 
G which are not incident on a univalent vertex is the number of disjoint transpositions in 
T u oT t , and the Euler characteristic of G or F(G) is given by the former minus the latter. 
Construction 12.81 provides an algorithm, which is easily implemented on the computer, 
to produce a triple (cr', r', lgus) from (a,T u ,T t ) which determines an untwisted fatgraph 
G' whose corresponding surface F(G') is the orientation double cover of F(G) according 
to Proposition 12.91 Proposition 12.101 provides an algorithm to determine the compatibly 
oriented boundary components of F(G') and hence the boundary components of F(G) 



14 R. C. PENNER, MICHAEL KNUDSEN, CARSTEN WIUF, AND J0RGEN ELLEGAARD ANDERSEN 



itself, and Proposition 12 . 1 Ol i then gives an algorithm to determine whether G' is connected 
from this data, where both of these methods are again easily implemented on the computer. 

In our applications of these techniques, the fatgraph G will typically be connected as 
we now assume. The orientability of F(G) can thus be ascertained from the connectivity 
of F(G'). The boundary components of F(G), and their number r in particular, can 
be determined, as above, and hence the modified genus g* = (2 — r — x)/2 is likewise 
easily computed. Thus, the topological type of F(G) can be determined algorithmically 
on the computer from the triple (<r, r u ,r t ) of permutations for a connected fatgraph G, 
and the particular edge-paths in G corresponding to boundary components of F(G) can 
be ascertained from the cycles of a' or'. 

3. The model 

We take as input to the method the specification for a polypeptide at equilibrium in 
some environment the following data: 

Input i) the primary structure given as a sequence Ri of letters in the 20-letter alphabet 
of amino acids, for i = 1, . . . , L; 

Input ii) the specification of hydrogen bonding among the various nitrogen and oxygen 
atoms {Ni, Oi : i = 1, . . . ,L} described as a collection B of pairs (i, j) indicating that 
Ni — Hi is hydrogen bonded to Oj, where i, j £ {1, . . . , L}; 

Input iii) the displacement vectors Xi from d to Ni+i, y*i from C" to d, and Zi from 

iVi-t-i to C" +1 in each peptide unit, for i = 1, . . . , L — 1. 
These data, which we shall term a polypeptide structure P, are either immediately given in 
or readily derived from the PDB and DSSP files for a folded protein. Practical and other 
details concerning the determination of these inputs will be discussed in Section 15.11 

A fatgraph is constructed from a polypeptide structure in two basic steps: modeling 
the backbone using the planarity of the peptide units and the conformational geometry 
along the backbone based on inputs i) and iii), and then adding edges to the model of the 
backbone for the hydrogen bonds based on inputs ii) and iii). 

We shall assume that input ii) is consistently based upon fixed energy thresholds with 
each nitrogen or oxygen atom involved in at most one hydrogen bond (so-called "simple" 
hydrogen bonding) and relegate the discussion of more general models (with so-called 
"bifurcated" hydrogen bonding) to Section \3. 41 The assumption thereby imposed on B in 
input ii) is that if (i, j), (i 1 € B, then i = i' if and only if j = j' . 

To each peptide unit is associated a fatgraph building block as illustrated in Figure [XT] 
These building blocks are concatenated to produce a model of the backbone as illustrated 
in Figure 13.21 where the determination of whether the edge connecting the two building 
blocks is twisted is based on input iii). Specifically, we shall associate to each peptide unit 
a positively oriented orthonormal 3-frame determined from input iii). A pair of consecutive 
peptide units thus gives a pair of such 3-frames, and there is a unique element of the Lie 
group SO(3) mapping one to the other. Using this, we may assign an element of SO(3) 
to each oriented edge of the graph underlying the fatgraph model and thereby determine 
an u SO(3) graph connection" (cf. the next section) on the underlying graph, which is a 
fundamental and independently interesting aspect of our constructions. This assignment 
is discretized using the bi-invariant metric on SO(3) to determine twisting and define the 
fatgraph model of the backbone, where there are special considerations to handle the case 
of cis-Proline, which can be detected using inputs i) and iii). 

Edges are finally added to this model of the backbone in the natural way, one edge for 
each hydrogen bond in input ii) ; see Figure 13.41 These added edges for hydrogen bonds 
may be twisted or untwisted, and this determination is again made by considering the 
SO(3) graph connection. 

Section [3.11 discusses generalities about 3-frames and SO(3) graph connections. Sec- 
tion [3]2] details the concatenation of fatgraph building blocks to construct the model of the 
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backbone, and S ect ion T3 . 3 1 explains the addition of edges corresponding to hydrogen bonds 
thus completing the description of the basic model. The final Section 13.41 discusses the 
general model with bifurcated hydrogen bonds plus other innovations and extensions of 
the method. An alternative to the basic model, which gives an equivalent but not strongly 
equivalent fatgraph that is arguably more natural than the basic model, is discussed in 
AppendixfA] and the standard motifs of polypeptide secondary structure are described in 
the alternative model. 

3.1. 50(3) graph connections and 3-frames. The Lie group SO(3) is the group of 
three- by-three matrices A whose entries are real numbers satisfying A A 1 = / and det(j4) = 
1, where A 1 denotes the transpose of A and / denotes the identity matrix. A metric d : 
50(3) x SO(3) -> R on SO (3) is said to be bi-invariant provided d(CAD, CBD) = d(A, B) 
for any A,B,C,D G 5*0(3). The Lie group SO(3) supports the unique (up to scale) bi- 
invariant metric 

d(A, B) = -i trace(log(AB 4 )) 2 , 

where the trace of a matrix is the sum of its diagonal entries and the logarithm is the 
matrix logarithm [6]. 

Proposition 3.1. For any A\,A2 £ 5*0(3), we have d(Ai,I) < d(A2,I) if and only if 
trace(j42) < trace(y4i), where d is the unique bi-invariant metric on 50(3). 

Proof. For any A € 50(3), there is B G 50(3) so that 

/ cos 9 sin 9 \ 
BAB 1 = -sin 9 cos9 

V 1/ 

for some angle < 9 < n, cf. [6]. It follows from bi-invariance that 

d(A, I) = d(BAB t ,BIB t ) = d(BAB t , I) = d(BAB~ 1 , I), 

i.e., distance to J is a conjugacy invariant, and from the formula for d that d{A,I) is 
a monotone increasing function of 9. On the other hand, trace(^l) = trace(BA_B _1 ) = 
trace(B J 4B t ) = 1 + 2 cos 9 is a monotone decreasing function of 9 which is also a conjugacy 
invariant, and the result follows. □ 

A (positively oriented) 3-frame is an ordered triple T = {u\,U2,uz) of three mutually 
perpendicular unit vectors in R 3 so that W3 = ui x U2. For example, the standard unit 
basis vectors (i,j,k) give a standard 3-frame. 

Proposition 3.2. An ordered pair T — (wi,tt2,W3) and Q = (v\,V2,vz) of 3-frames 
uniquely determines an element D £ 50(3), where DUi = Vi, for i — 1,2, 3. Furthermore, 
the trace of D is given by ui ■ vi + U2 ■ v>2 + U3 • «3, where ■ is the usual dot product of 
vectors in R 3 . 

Proof. Express 

Ui — aiii + a2ij + a,3ik, 
Vi = bui + b 2 ij + b-iik, 

for i = 1, 2, 3, as linear combinations of i,j, k. The matrices A — (aij) and B = (bij) thus 
map i,j, k to ui,U2,«3 and vi , V2 , W3 , respectively. It follows that the matrix D = BA -1 
indeed has the desired properties. If D' is another such matrix, then D~ 1 D' must fix each 
vector ui,U2,W3, and hence must agree with the identity proving the first part. For the 
second part since trace is a conjugacy invariant, we have 

3 

trace(£M -1 ) = trace (A - 1 B) = tracer's) = ' Vi 

i=i 

as was claimed. □ 
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Suppose that F is a graph. An SO(3) graph connection on V is the assignment of an 
element Af £ SO(3) to each oriented edge / of F so that the matrix associated to the 
reverse of / is the transpose of Af. Two such assignments Af and Bf are regarded as 
equivalent if there is an assignment C u £ SO(3) to each vertex u of T so that Af = 
CuBfC^ 1 , for each oriented edge / of T with initial point u and terminal point w. An 
SO(3) graph connection on V determines an isomorphism class of flat principal 5*0(3) 
bundles over F, cf. [8]. Given an oriented edge-path 7 in F described by consecutive 
oriented edges /o — /1 — • • • — fk+i, where the terminal point of ft is the initial point of 
fi+i, for i = 0, . . . , k, the parallel transport operator of the SO(3) graph connection along 
7 is given by the matrix product p(7) = Af Q Af 1 ■ ■ ■ Af k € SO(3). In particular, if the 
terminal point of fk agrees with the initial point of /o so that 7 is a closed oriented edge- 
path, then trace(/o(7)) is the holonomy of the graph connection along 7 and is well-defined 
on the equivalence class of graph connections. 

3.2. Modeling the backbone. In this section, we shall define our model T(P) for the 
backbone of a polypeptide structure P. To this end, consider the fatgraph building block 
depicted in Figure 1X11 which consists of a horizontal segment and two vertical segments 
joined to distinct interior points of the horizontal segment, the vertical segment on the left 
lying above and on the right below the horizontal segment. Each such building block rep- 
resents a peptide unit, as is also indicated in the figure, where the left and right endpoints 
of the horizontal segment represent Cf and Cf +1 and are labeled by the corresponding 
residue Ri and Ri+i, respectively, the left and right trivalent vertices represent d and 
Ni+i, respectively, and the endpoints of the vertical segments above and below the hor- 
izontal segment represent d and -ffi+i, respectively, or in the case that Ri+i is Proline, 
the endpoint of the vertical segment below the horizontal segment instead represents the 
non-alpha carbon atom bonded to iVj+i in the Proline ring. In the case of cis-Proline, a 
more geometrically accurate building block would have the vertical segment on the right 
also lying above the horizontal segment as indicated by the skinny line in the figure, but 
we nevertheless use a single building block in all cases for convenience. 



°< c° + , °> o, r °: 

C; — Ni+J 

, H.V c / ^ a l»l t !-,-l'li'l! ! hf 

(orCfortra^'-Proline) (except for toms-Proline) 1 



|-Rw \=nJ R,-l j-I 

lor cis-Proline I 



Figure 3.1. Fatgraph building block 



Fix a polypeptide structure P and start by defining a fatgraph T'(P) as the concate- 
nation of L — 1 copies of this fatgraph building block, where the two univalent vertices 
representing Cf + i are identified so that the two incident edges are combined to form a 
single horizontal edge of T' called the (i + l)st alpha carbon linkage, for i — 1, ... ,L — 2 
as illustrated in Figure 13.21 let us also refer to the horizontal edges incident on the vertex 
corresponding to Cf and Cf as the first and Lth alpha carbon linkages, respectively, so 
the ith alpha carbon linkage is naturally labeled by the amino acid Ri, for i = 1, . . . , L. 
Thus, T'(P) consists of a long horizontal segment composed of 2L — 1 horizontal edges, 
L of which are alpha carbon linkages and L — 1 of which correspond to peptide bonds, 
with 2L — 2 short vertical edges attached to it alternately lying above and below the long 
horizontal segment. We shall define the fatgraph T(P) by specifying twisting on the alpha 
carbon linkages of T'(P). 
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Figure 3.2. Concatenating fatgraph building blocks 



Construction 3.3. Associate a 3-frame Ti = (ui,Vi, wi) to each peptide unit using input 
iii) by setting 



_ _ 1 _ 



\y> 



l 

(Hi ■ yi) Ui\ 



yi — {ui ■ yi) Hi 



Wi — Ui X Vi, 

for i = 1, . , . , L — 1, where |t| denotes the norm of the vector t. 

Thus, Hi is the unit displacement vector from d to Ni+i, Vi is the projection of yt onto 
the specified perpendicular of Ui in the plane of the peptide unit, and Wi is the specified 
normal vector to this plane. 

According to Proposition 13.21 there is a unique element Ai £ 5*0(3) mapping Ti to 
J-i+i, for i — 1, ... ,L — 2. Define the backbone graph connection on the graph underlying 
T'(P) to take value / on all oriented edges except on the ith alpha carbon linkage oriented 
from its endpoint representing Ni to its endpoint representing d, it takes value A4-1, for 
i = 2,...,L- 1. 

We shall discretize the backbone graph connection to finally define the backbone fat- 
graph model T(P). To this end, in addition to the 3-frames Ti = (ui,Vi,Wi) of Construc- 
tionist we consider also the 3-frames Qi = (ui,—Vi,—Wi), which correspond to simply 
turning T upside down by rotating through 180 degrees in three-space about the line 
containing Cj and Ni+i, for i = 1,... ,L — 1. Again, by the first part of Proposition 13.21 
there is a unique element Bi £ SO(3) taking T to <5i+i- By construction, Ai also takes 
Qi to Gi+i, and Bi takes Qi to T+i. 

Construction 3.4. For any polypeptide structure P, define the fatgraph T(P) derived 
from T'(P) by taking twisting only on certain of the alpha carbon linkages, where the 
(i + l)st alpha carbon linkage is twisted if and only if 

d(I,Bi) < d(I,Ai), if Ri+i is not cis— Proline; 
d(I,Bi) > d(I,Ai), if Ri+i is cis— Proline, 

for i = 1, . . . , L — 2, where d is the unique bi-invariant metric on 5*0(3). 

Corollary 3.5. The (i + l)st alpha carbon linkage of the backbone model T{P) is twisted 
if and only if 

Vi ■ v i+ i + Wi ■ w i+ i < 0, if Ri+i is not Proline or yi ■ Zi > 0; 
Vi ■ Vi+i + Wi ■ w i+ i > 0, if Ri+i is Proline and y\ ■ Zi < 0, 

fori = l,...,L-2. 
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Proof. According to Proposition l3.il d(Ai, I) < d(Bi, I) if and only if trace(Bi) < trace(Ai) 

According to the second part of Proposition 13.21 we have 

trace(Ai) = u 4 ■ u i+1 + v f ■ v i+1 + w t ■ w i+ i, 
trace(Bi) = ili ■ u i+ i — Vi ■ Vi+i — Wi ■ Wi+i, 

so that trace(Ai) — trace(Bi) = 2(i>i ■ + Wi ■ Wi+i). 

Thus, if Ri+i is not Proline, then we twist the (z + l)st alpha carbon linkage if and 
only if Ti is closer to Gi+i than it is to J-i+i in the sense that d(I, Ai) > d(I, Bi), and this 
is our natural discretization of the backbone graph connection in Construction 13.41 in this 
case. If Ri+i is Proline, then it is in the cis-conformation if and only if yt • z% < 0, so we 
twist the (i + l)st alpha carbon linkage for cis-Proline only if d(I, Ai) < d(I,Bi). To see 
that this is the natural discretization of the backbone graph connection in this case, notice 
that the 3- frame T% in Construction l3~!3l is determined using the displacement vectors Xi 
from d to Ni+i and yi from Cf to d, which are insensitive to whether J?;+i is in the 
cis-conformation. It is therefore only upon exiting a cis-Proline along the backbone that 
the earlier determination should be modified since the latter displacement vector should 
be replaced by its antipode. □ 



en 
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Figure 3.3. Level sets of trace(^l) — trace(B) on a Ramachandran plot 

Define the flip sequence of G(P) to be the word in the alphabet {F, N} whose ith letter 
is N if and only if the (i + l)st alpha carbon linkage is untwisted, for i = 1, . . . , L(G) — 2. 
The flip sequence thus gives a discrete invariant assigned to each alpha carbon linkage 
derived from the conformational geometry along the backbone. The flip sequence can be 
determined directly from the conformational angles along the backbone using the following 
result. 
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Proposition 3.6. Under the idealized geometric assumptions of tetrahedral angles among 
bonds at each alpha carbon atom and 120-degree angles between bonds within a peptide unit, 
the matrix A = Ai in Construction \3.4\ can be calculated in terms of the conformational 
angles tp — tpi and tp — tpi as follows: 
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Explicitly, this is the representative A = Ai in its conjugacy class for which the 3-frame 
vectors Hi — i,Vi = j,Wi = k in Construction \ are given by the standard unit basis 
vectors. 

Proof. Let £ be an angle and v be a non-zero vector in R 3 . We denote by (£, v) the linear 
transformation R 3 — > R 3 which rotates R 3 through the angle £ around the line spanned 
by v in the right-handed sense in the direction of v. By following the standard 3-frame 
along the backbone in the natural way one bond at a time, we find 



A = B 6 faij)B 5 (<p, 4>)B i (<p,TP)B 3 (cp)B 2 {<p)B 1 (TT/3) 



where 



fli(0 = (£, fc), B 2 (ip) = fa Bi(7r/3)i), B 3 (4) = (ir-9, B 2 (<p)k), 
Btfaip) = (^,B 3 (v>)Bi(tt/S)T), Bsfaip) = {2w/3,-B i faip)B 3 faB 2 fak), 

B e (tp,il;) = (■K,B 5 fa^)B 4 fa^)B 3 faB 2 faB 1 (iv/3)j), 

and where 8 — 2 arctan(y / 2) is the tetrahedral angle « 109.5 degrees, for which cos = — 1. 
We observe that 

B 4 fa<ip)B 3 fa = B 3 faB 2 fa) 

whence 

BAfatl))B 3 fa)B 2 fa = B 3 faB 2 (tp + ip), 

and therefore 

A = B 6 (vp,i/>) B 3 (0) B 2 (tp + i,) Bi(-tt/3). 
Setting Bo = we conclude 

4 = B 3 (tp)B 2 (tp + iP)B 1 (-it/3)B , . 
which devolves after some computation to the given expression. □ 

Remark 3.7. It is interesting to graph the level sets of trace(j4) — trace(B) on the Ra- 
machandran plot, i.e., the plot of pairs of conformational angles (tpi,ipi), for the entire 
CATH database [24] using Proposition I3.6I as depicted in Figure I3.3I where the matrix 
B = Bi of Construction I3.31 is obtained from A — Ai in Proposition I3.6I by pre-composing 
it with rotation by n about i. In particular, the zero level set fairly well avoids highly 
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populated regions, so the case of near equality in Construction 13.41 is a relatively rare 
phenomenon for proteinsQ 

3.3. Modeling hydrogen bonds. The fatgraph model T(P) of the backbone of a polypep- 
tide structure P defined in the previous section is here completed to our fatgraph model 
G(P). Just as in the previous section, we shall first define another fatgraph G'(P) from 
which G(P) is derived by further twisting certain of its edges. As described in the previous 
section, T(P) consists of a long horizontal segment, certain of whose alpha carbon link- 
ages are twisted, together with small vertical segments alternately lying above and below 
the long horizontal segment, where the (i + l)st alpha carbon linkage is labeled by its 
corresponding amino acid iZ>+i, for i = 1, . . . ,L. The endpoints of the vertical segments 
above and below the horizontal segment respectively represent the atoms Oi and Hi+i 
except for the vertical segments below the horizontal segment preceding an alpha carbon 
linkage labeled by Proline, whose endpoint represents the non-alpha carbon atom bonded 
to iVi+i in the corresponding Proline ring, for i = 1, . . . , L — 1. 




Figure 3.4. Adding edges to T(P) for hydrogen bonds 

Construction 3.8. For each G B in input ii), adjoin an edge to T(P) without 

introducing new vertices connecting the endpoints of short vertical segments corresponding 
to Hi and Oj to produce a fatgraph denoted G'(P). 

See Figure [3~4l It is important to emphasize that the relative positions of these added edges 
corresponding to hydrogen bonds other than their endpoints are completely immaterial to 
the strong equivalence class of G'(P). The edges of T(P) corresponding to the non alpha 
carbon atoms in a Proline rings are never hydrogen bonded in our model. 

To complete the construction of G(P), it remains only to determine which edges of the 
fatgraph G'(P) are twisted. To this end, suppose that € B in input ii). According to 
our enumeration of peptide units, Hi occurs in peptide unit i—1 and Oj occurs in peptide 
unit j, and there are corresponding 3-frames 

Ti-i = (ui_i,?7i_i,iiJj_i), 

Sj = (Uj,-Vj,-Wj), 

from Construction 13.31 



^Indeed, further scrutiny of detail in Figurc [3.3l which is not depicted, shows that the zero level set 
does penetrate into conformations of "beta turns of types II and VI", cf. the discussion of Figurc [A.3l 
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Construction 3.9. As before by the first part of Proposition 13,21 there are unique 
Di,j,Eij 6 50(3) taking !Fi—i to J-j,Gj respectively. An edge of G'(P) corresponding to 
the hydrogen bond £ B is twisted in G(P) if and only if 

d(I,E id ) < d(I,D itj ), 

where d is the unique bi-invariant metric on SO(3). 

As before, a short computation gives: 

Corollary 3.10. The edge of G(P) corresponding to the hydrogen bond £ B is 

twisted if and only if Vi-\ ■ Vi + Wi-i ■ Wj < 0. 

Remark 3.11. The backbone graph connection on the graph underlying T(P) clearly has 
trivial holonomy since T(P) is contractible. It extends naturally to an SO(3) graph 
connection on the graph underlying G(P), where to the oriented edge corresponding to the 
hydrogen bond connecting Ni — Hi and Oj , we assign the unique element of 5*0(3), whose 
existence is guaranteed by Proposition 13.21 which maps to Tj, for i = 2, ...,L — 2. 

This graph connection on G(P) also has trivial holonomy by construction. Our fatgraph 
model G(P) arises from a discretization of this SO(3) graph connection giving a Z/2 graph 
connection, where the oriented edges with non-trivial holonomy are the twisted ones, and 
this Z/2 graph connection on the graph underlying G(P) typically does not have trivial 
holonomy. 

3.4. The basic model and its extensions. The previous section completed the defini- 
tion of our basic fatgraph model G(P) of a polypeptide structure P. Notice that hydrogen 
bonds and alpha carbon linkages are treated in precisely the same manner in this con- 
struction. 

A crucial point in practice is that the polypeptide structure itself depends upon data 
which must be considered as idealized for various reasons: proteins actually occur in sev- 
eral closely related conformations, varying under thermal fluctuations for example, whose 
sampling is corrupted by experimental uncertainties as well as errors. The fatgraph G(P) 
must therefore not be taken as defined absolutely, but rather as defined only in some 
statistical sense as a family of fatgraphs {G(P) : P £ V} based on a collection V of 
polypeptide structures which differ from one another by a small number of such idealiza- 
tions, uncertainties, or errors. Properties of the fatgraph G(P) that we can meaningfully 
assign to the polypeptide structure P must be nearly constant on V leading to the notion 
of "robustness" of invariants of G(P) as descriptors of P, which is discussed in Section [3] 
Nevertheless, the construction of our model has been given based on the inputs above 
regarded as exact and error- free. 

In particular, there is the tacit assumption that there is never equality in the determi- 
nation of whether to twist in Constructions \3A\ In practice, Vi ■ Vi+i + Wi ■ Wi+i = never 
occurs exactly, but there is the real possibility that this condition nearly holds, that is, 
we cannot reliably determine whether to twist if \vi ■ Vi+i + Wi ■ tl?t+i| is below some small 
threshold because of experimental uncertainty, cf. Remark 13.71 There are similar issues 
in the specification of which hydrogen bonds exist in input ii) based upon the possibly 
problematic exact atomic locations from which the electrostatic potentials are inferred as 
well as whether to twist in Construction 13.91 

However, there is the following control over the topological type of F(G(P)), which 
will be the basis for several of the robust invariants of fatgraphs and resulting meaningful 
descriptors of polypeptides studied in Section [4] 

Corollary 3.12. Let P, P' be polypeptide structures with the same inputs i) but differing in 
inputs ii-iii) in the determinations of the existence of m hydrogen bonds and of the twisting 
of n alpha carbon linkages or hydrogen bonds. Then \r(G(P)) — r(G(P')) \ < m + n. 



Proof. This is an immediate consequence of Proposition 12.21 



□ 
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There are several generalizations of the basic fatgraph model G(P) of a polypeptide 
structure. As already mentioned, we might specify energy thresholds E- < E+ < and 
demand that the potential energy of a hydrogen bond lie in the range between E- and E+ 
in order that it be regarded as a hydrogen bond to include in input ii) so as to produce 
a fatgraph denoted Ge_ ,e + {P)- We shall describe in Section [5] certain experiments with 
proteins using various such energy thresholds. 

One may also model bifurcated hydrogen bonds and allow hydrogen or oxygen atoms in 
the peptide units to participate in at most /3 > 1 hydrogen bonds by altering the fagraph 
building block in Figure |3~T1 by replacing the univalent vertices representing hydrogen and 
oxygen atoms by vertices of valence /3 + 1. Different valencies less than 0+1 for oxygen and 
hydrogen can be implemented with this single building block by appropriately imposing 
different constraints in input ii). Natural fattenings on these new vertices representing 
hydrogen or oxygen atoms are determined as follows: project centers of partners in bonding 
into the plane of the peptide unit with origin at the center of the corresponding nitrogen or 
carbon atom, respectively, where the positive :r-axis contains the bond axis of the incident 
peptide bond, and take these projections in the ordering of increasing argument. 

Our definition of polypeptide structure assumes that there are no atoms missing along 
the backbone, and this is actually somewhat problematic in practice. A useful aspect of 
the methods in Section [3.21 is that such gaps present no essential difficulty since an edge 
connecting fatgraph building blocks can just as well be taken to represent a gap between 
peptide units as to represent an alpha carbon linkage as in our model articulated before. 
The determination of twisting on these new gap edges is just as in Construction 13.41 but 
now the 3-frames in this construction do not correspond to consecutive peptide units. 

A more profound extension of the method is to use the bi-invariant metric on SO(3) to 
give finer discretizations of the SO(3) graph connection on G(P) discussed in Remark l3.f f I 
For example, rather than our Z/2 graph connection modeled by fatgraphs, one can easily 
implement the analogous construction of an Z/n graph connection based on the natural 
extensions of Constructions 13.41 and 13.91 modeled by graphs with fattenings and Z/n- 
colorings. These "rotamer fatgraphs" capture the "protein rotamers" which are highly 
studied in the biophysics literature. 

A still more profound innovation rests on the observation that our techniques are of 
greater utility and can be adapted to model essentially any molecule since 3-frames can 
analogously be associated to any bond axis. One might thus model entire amino acids 
themselves as rotamer fatgraphs to give a truly realistic model of a polypeptide. 

Furthermore, the discussion thus far has concentrated on molecules at equilibrium, and 
one might instead regard the fatgraph or rotamer fatgraph as a dynamic model by taking 
time or temperature dependent inputs i-iii). 



4. Robust polypeptide descriptors 

We have described in the previous sections the fatgraph G(P) of a polypeptide struc- 
ture P with simple hydrogen bonding determined by inputs i-iii) based upon specified 
energy thresholds. With the understanding that the input data can be problematic due to 
errors and experimental indeterminacies, we must consider the fatgraph as defined only in 
a statistical sense, where a family of fatgraphs arises from a collection V 3 P of polypep- 
tide structures which differ from P by a small number of such errors or indeterminacies. 
As such, only certain properties of the fatgraph G(P) can meaningfully be assigned as 
descriptors of P, namely, those properties which do not vary significantly over the various 
polypeptide structures in P. In this section, we shall first formalize this notion of mean- 
ingful properties of fatgraphs, and then describe and discuss a myriad of such polypeptide 
descriptors. 
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Let Q denote the collection of all strong equivalence classes of fatgraphs G(P) arising 
from non-empty polypeptide structures P. We may perform the following modifications 
to any G £ Q leaving all other data unchanged: 
Mutation i) change the color of one alpha carbon linkage of G; 
Mutation ii) change the color of one edge of G corresponding to a hydrogen bond; 
Mutation iii) add or delete an untwisted edge of G corresponding to a hydrogen bond; 

Mutation iv) replace a fatgraph building block of G by two building blocks connected 
by an untwisted alpha carbon linkage, where any edges corresponding to hydrogen bonds 
incident on the original building block are connected to the replacement building block 
that occurs first along the backbone from N to C termini, and the reverse of this operation. 

Suppose that X is some set with metric p. We say that a function v : G ^ X is n-robust 
of radius Q onTL C Q, where re > is real and Q > is an integer, if p(v(G), v(G')) < qn 
whenever G' arises from G £ TL by a sequence 

G = G - Gi G q = G', with q<Q, 

where Gj+i arises from Gj by a single mutation of type i-iv), for j = 0, . . . , q — 1. If v is 
re-robust of infinite radius on all of Q, then we say simply that v is re-robust. 

By definition if X supports operations of addition and scalar multiplication and if v is 
re-robust of radius Q on TL, then for any a £ 1, av is are-robust of radius Q on TL, and 
furthermore, if u' is re'-robust of radius Q' on TL' , then v ± v' is (re + re')-robust of radius 
min(Q,Q') onHflW. 

It is only the re-robust functions v of reasonably large radius Q and sufficiently small 
value of re on TL C Q which are significant characteristics of polypeptide structures whose 
fatgraphs G lie in TL. This is because a combination of mutations arising from q < Q 
errors or indeterminacies of the input data then affects the value of v(G) by an amount 
bounded by qn, which must be small compared to the value of v(G). 

It is clear that any two fatgraphs arising from a non-empty polypeptide structure are 
related by a finite sequence of mutations i-iv). By assigning a penalty of some non-zero 
magnitude to each type of mutation, the mutation distance between two such fatgraphs 
can be defined as the minimum sum of penalties corresponding to sequences of mutations 
relating them. This gives a metric, albeit seemingly difficult to compute, on Q itself, 
and we may regard two polypeptide structures as being similar if the mutation distance 
between their corresponding fatgraphs is small. The assignment of fatgraph G(P) to 
polypeptide structure P is re-robust by definition with this metric, where the parameter 
re is the largest penalty. 

For several obvious numerical examples, the numbers L(G) of residues and B(G) of 
hydrogen bonds of G are 1-robust, and the Euler characteristic x{G) of G or F(G) is 
likewise 1-robust since x(G) = 1 — B(G). The numbers v(G) = 2L(G) — 2 of vertices 
and e(G) = B(G) + 2L(G) — 3 of edges of G are therefore 2- and 3-robust respectively. 
The number of twisted edges corresponding to hydrogen bonds and the number of twisted 
alpha carbon linkages of G are each also clearly 1-robust. 

With X the set of all words of finite length in the alphabet {F, N} given the edit 
distance with unit operation cost [12], the flip sequence of G is 1-robust by definition. 
In contrast, the plus/minus sequence of the alternative model K(P) in Appendix[X]as a 
word in the alphabet { + , — } with the same metric is not re-robust of radius greater than 
zero on Q for any re since a single modification of type i) to G can change all the entries 
of the plus/minus sequence. 

For another negative example with X = R, the genus g(G) of F(G) is not re-robust of 
any radius greater than zero for any re on Q since a single modification of type ii) on an 
untwisted G can produce a fatgraph G' with F(G') non-orientable, and \g(G) — g{G')\ = 
[1 + B(G) — r(G)]/2. In contrast, the modified genus is robust of infinite radius according 
to the following result. 
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Proposition 4.1. The number r(G) of boundary components and the modified genus 
g*(G) of F(G) are 1-robust. Moreover, the number of appearances in the flip sequence of 
G of any fixed word of length k in the alphabet {0, 1} is k-robust. 



Proof. The function r satisfies the required properties by Corollary 13.121 hence so too 
does g* — (1 + B — r)/2. The remaining assertion follows essentially by definition. □ 

Given a closed edge-path 7 on G G Q , define the peptide-length of 7 to be the number 
of pairs of distinct peptide units visited by 7 and define the edge-length of 7 to be the 
number of edges of G traversed by 7, each counted with multiplicity. For example, the 
dotted boundary components in Figure IA.3I that are characteristic of alpha helices and 
beta strands all have peptide-length 4 and various edge-lengths 4,6,8. Define the peptide- 
length spectrum P(G) and the edge-length spectrum E(G) of G G Q , respectively, to be 
the unordered set of peptide-lengths and edge-lengths of boundary components of F{G). 
Let P(G) and E(G) denote their respective means. It is worth pointing out that the 
preponderance of alpha helices and beta strands in practice heavily biases P(G) towards 
4. 

Let X denote the collection of all finite unordered collections of natural numbers. The 
elements of a member of X may be ordered by increasing magnitude. The distance between 
two such ordered finite collections of natural numbers may then be defined by standard 
methods [12], and this induces a metric on X itself. We may thus regard P and E as 
functions on Q with values in the metric space X. As in the proof of Corollary 13. 121 these 
functions are K-robust where the parameter k depends on the choice of metric. 

Lemma 4.2. Suppose that fi : Q —> Z is k-robust of radius at least Q on Q and that 
v : Q — > R is K-robust of radius Q on 

H = {G e G ■ KG) > kQ and u(G) +Qn< \p(G) - kQ] 2 }. 

Then v(G)l KG) : Q — > R is (k + k) -robust of radius Q onTt. 

Proof. Suppose that G G TL and that G = Go — Gi — • • • — G q = G' is a sequence as before, 
with q < Q. First note that 

v(Gi+i) < v(Go) + iK and fi(G i+ i) > m(Go) — ki, 

by hypothesis, and so 

KG i+ i) u(Go) +in < u{Gp) + Qk < 



[/x(G i+ i)] 2 " [M(G )-fei] 2 " [/ti(Go) - kQ] 2 

since Go G TL, for i = 0, . . . ,p. Furthermore, we have that \v(Gi) — v(Gi+i)\ < k and 
|ju(Gj) — n(Gi+\)\ < k, for each i = 0, ... ,q — 1, and hence 

u(Gi) v{G i+ i) _ n(G i+1 )v(Gi) - p(Gi)v(Gi+i) 



^(Gi)^(G i+ i) 



< { 



Im(g*)I' 



+ k 



k(Gi+l)l 



I /, W(Gj)\ 



if n(G i+ i) = 
if n(G i+ i) > ;u(Gi 



< k + fc. 

The triangle inequality then gives 

KG) KG) 



as required. 



KG) KG) 



< q(K + k) 



□ 
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Proposition 4.3. The mean P(G) of the peptide-length spectrum is 3-robust of radius Q 
on 

{GeG: r(G) > Q and L(G) + Q-1< \[r{G) - Q] 2 }, 
and the mean E(G) of the edge-length spectrum is 7-robust of radius Q on 

{GeG: r(G) > Q and B(G) + 2L(G) - 3 + 6Q < [r(G) - Q] 2 }. 

Proof. Since each peptide unit occurs exactly twice in the union of all the boundary 
components, the sum of all the elements in P(G) is constant equal to 2[L(G) — 1], which 
is 2-robust according to earlier comments. Since P(G) = 2[L(G) — l]/r(G) and r(G) 
is 1-robust by Lemma 14.11 the first assertion follows from Lemma 14.21 Similarly, each 
edge occurs exactly twice in the union of all boundary components, so the sum of all the 
elements in E(G) is equal to 2e(G) = 2[B(G) + 2L(G) — 3], which is 6-robust according to 
earlier comments. The second assertion therefore likewise follows from Lemma [4.21 □ 

Other notions of lengths of closed edge-paths in G may also be useful. For example, for 
each amino acid type, each boundary component of F(G) visits a certain number of alpha 
carbon linkages labeled by amino acids of this type, and alternative notions of length arise 
by assigning weights to the various amino acids and taking the weighted sum over amino 
acids visited. The robustness of these sorts of invariants seems difficult to analyze. 

It is also worth pointing out that the underlying graph of the fatgraph G(P) has its 
own related characteristics for any polypeptide structure P. For example, there is an 
associated notion of length spectrum, namely, one or another of the notions of generalized 
length discussed before of the closed edge-paths or simple closed edge-paths on the graph. 
Invariants of this type, which can be derived from the graph underlying the fatgraph, may 
also be of importance in practice, and their robustness is based on the invariance of the 
underlying graph under the modifications i-ii). 

The fatgraph G is of a special type in that it has a "spine" arising from the backbone, 
namely, the long horizontal segment arising from the concatenation of horizontal segments 
in the fatgraph building blocks which was discussed in Section f3. 2 1 This "spined fatgraph" 
admits a canonical "reduction" by serially removing each edge incident on a univalent 
vertex and amalgamating the pair of edges incident on the resulting bivalent vertex into 
a single edge. The graph underlying this reduced spined fatgraph is a "chord diagram", 
and there are countless "finite-type invariants associated with weight systems" 23 , which 
could provide useful protein invariants whose robustness depends upon the choice of weight 
system. See Section HJ] for a further discussion of related quantum invariants. 

5. First results 

5.1. Aspects of implementation. In this section, we shall first make several practical 
remarks about the implementation in this paper of our methods for a protein from its 
PDB and DSSP files, cf. Section [T] where we shall consider here only the model with 
simple hydrogen bonds, i.e., (3=1, which depends upon energy thresholds _E_ < E+ < 
as follows. In effect, we employ the standard methods of DSSP described in Section 
[1] to estimate electrostatic potentials of possible hydrogen bonds, and we tabulate to 
hundredths of kcal/mole the two strongest such potentials in which each hydrogen or 
oxygen atom in a polypeptide unit participates. Any such energies beyond our energy 
thresholds are then discarded. Displacements of corresponding backbone atoms are used 
to discriminate between equal tabulated electrostatic potentials in order to derive a strict 
linear ordering on them: a hydrogen bond with energy E between atoms at distance 5 
precedes a hydrogen bond with energy E' between atoms at distance 5' if E < E' or if 
E — E' and 5 < 8' , where E — E' to hundredths of kcal/mole and 8 = 5' to thousandths of 
Angstroms never occurs in practice. We finally greedily add to B in input ii) the hydrogen 
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bonds in this linear ordering provided they do not violate the a priori simple hydrogen 
bond assumption (3 = 1. 

Minor technical comments are that unspecified or missing residue types are assumed not 
to be Proline for input i), atomic locations in the PDB with highest occupancy numbers 
are those used for determining input iii), and we take only the first model in case there 
are several models in a PDB file. 

Whenever there is a missing datum, for example the atomic location of a backbone atom 
in a PDB file, that is required for the algorithmic construction of the 3- frame corresponding 
to its peptide unit, we concatenate an associated fatgraph building block without twisting 
the alpha carbon linkage, and we prohibit any hydrogen bonding to its constituent edges. 
Such "gap frames" are included for each problematic peptide unit. A number of such gap 
frames may occur between two fatgraph building blocks that can consistently be assigned 
3-frames, and the last alpha carbon linkage connecting a gap frame to a non-gap frame is 
twisted or untwisted based upon the usual criteria for the two adjacent well-defined non- 
gap frames. In particular, the fatgraph constructed is always connected. Other examples 
of gap frames arise from breaks along the backbone as detected by a separation of more 
than 2.0 Angstroms between atoms d and Ni+i, for any i. 

5.2. Injectivity results. The database CATH version 3.2.0 [53] is a collection Pcath 
of 114,215 protein domains, which are uniquely catalogued by a nine-tuple of natural 
numbers; this is a hierarchical classification with a "standard" representative domain 
chosen in each class. Our methods have been applied to the associated PDB and DSSP 
files so as to produce corresponding connected fatgraphs G-oo,e(P) for each P € Pcath 
and various energy thresholds E < 0. We have concentrated here just on the question of 
finding tuples of robust invariants that uniquely determine the domain P among all the 
domains in Pcath, or the standard representatives of all the classes at some level, and 
this section simply presents these empirical "injectivity" results. 

Our first results rely only on the most basic of robust invariants which depend only 
on the topological type of the surface, namely, the modified genus g%(P) and the number 
rs{P) of boundary components of J ? (G_ 00i b(P)). 

Result 5.1. The 14 numbers (g* E (P),r E (P)), with E = -0.5(1 +t), for integral < 
t < 6, uniquely determine the primary structure of each P £ "Pcath except for the special 
cases given in Table 5.1. In particular, these 14 numbers uniquely determine the depth 7 
classes (CATHSOL) except for the four following special cases: 3.40.50.720.63.1.1.1.1 and 
3.40.50.720.63.1.2.1.1; 3.30.70.270.7.1.2.1.1 and 3.30.70.270.2.1.5.5.2; 2.10.210.10.1.1.1.1.1 
and 1.10.8.10.13.1.1.1.2; 2.10.69.10.3.2.2.1.1 and 2.10.69.10.3.2.5.1.1.. 

The next injectivity result relies upon several robust invariants of the fatgraph. 

Result 5.2. For any polypeptide structure P and energy threshold E < 0, consider the 
10 numbers given by: the number of residues of P, the number of hydrogen bonds of P 
with energy at most E, rE(P), g'klP), the mean of the peptide length spectrum to one 
significant digit, the number of twisted alpha carbon linkages of G-oo,e(P), the number 
of twisted edges of G_oo,b(P) corresponding to hydrogen bonds, the respective number of 
pairs FF, FN, and NN occurring in the flip sequence. These numbers for the single energy 
level E = —0.5 uniquely determine the standard representatives of Pcath classes at depth 
four (CATH) except for the 19 exceptions enumerated in Table [5~^ 

Our final injectivity result relies only on the model of the backbone, namely, on the 
flip sequence. 

Result 5.3. The flip sequence nearly uniquely determines elements of Pcath with the 45 
exceptions enumerated in Table 5.3. 
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Table 5.1. Exceptions to injectivity in Result [5~T1 
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Invariants 

(26. 5, SO, 23. 5,66,21.5, 58,16.5,44, 11. 5, 18, 5. 0,4, 3. 0,2) 

(36.5,81,32.5,72,31.5,66,29.0,56,23.5,34,14.0,12,2.0,2) 
(34.5,84,32.5,71,31.5,69,29.5,56,22.5,41,14.0,20,5.0,9) 
(20.5,89,17.5,82,14.0,66,8.5,48,6.5,25,3.0,12,1.0,3) 
(41.0,99,30.5,76,25.5,51,14.0,31,8.0,19,5.5,9,0.5,3) 
(20.5,89,17.5,82,14.0,67,8.5,48,6.5,25,3.0,12,1.0,3) 

(8.0,71,6.0,63,5.5,55,5.0,43,3.0,17,0.0,4,0.0,1) 
(19.5,68,16.5,54,12.5,48,12.5,28,7.5,18,1.0,11,1.0,4) 
(4.0,96,4.0,91,2.5,86,1.0,64,0.0,18,0.0,1,0.0,1) 
(4.0,102,3.0,93,2.5,84,1.0,58,0.0,22,0.0,2,0.0,1) 
(7.5,38,7.0,33,5.0,32,2.5,20,1.5,10,1.0,5,0.5,4) 
(1.0,29,0.5,29,0.5,27,0.0,20,0.0,11,0.0,5,0.0,2) 
(4.5,169,4.0,157,2.5,145,2.0,113,1.0,59,0.5,10,0.5,1) 
(34.0,83,32.0,76,31.5,71,29.0,55,20.0,41,16.0,26,5.0,7) 
(36.0,136,34.0,123,32.0,114,23.5,85,8.0,40,2.0,15,0.0,5) 
(0.0,11,0.0,6,0.0,4,0.0,2,0.0,2,0.0,1,0.0,1) 
(0.5,32,0.0,30,0.0,29,0.0,20,0.0,8,0.0,2,0.0,1) 
(0.5,97,0.5,94,0.5,85,0.5,65,0.5,25,0.5,5,0.0,1) 
(1.0,21,1.0,17,0.5,15,0.5,14,0.5,9,0.0,3,0.0,2) 
(1.5,42,1.5,42,1.5,39,0.5,32,0.0,16,0.0,5,0.0,1) 
(1.5,43,1.5,42,1.5,38,0.5,32,0.0,17,0.0,5,0.0,1) 
(3.0,21,3.0,18,3.0,15,3.0,13,1.5,9,0.5,3,0.5,1) 
(3.0,21,3.0,18,3.0,16,3.0,13,2.0,8,0.5,3,0.0,1) 

(4.5,8,4.5,8,3.5,7,2.5,5,1.0,4,0.0,2,0.0,1) 
(4.5,35,3.5,34,3.0,29,1.5,23,1.0,14,0.0,6,0.0,1) 
(4.5,51,4.5,42,4.5,32,4.0,20,3.5,15,2.5,5,1.0,4) 
(5.5,48,4.5,40,4.0,32,3.0,18,1.0,12,0.0,10,0.0,4) 
(6.0,42,5.5,36,5.5,31,5.0,29,4.5,15,2.0,1,0.0,1) 
(6.0,44,5.5,39,4.5,30,4.5,23,3.5,14,1.0,6,1.0,2) 



(6. 


5, 


32 


6.0,30,5.5,27,3.5,28,2 


.5 


,16,1 


■5,4,0 


.0 


,1) 


(6 


5 


44,4.5,41 


4.5,35,4.0,25,3 


5 


13,2 


5 


7,0 


5 


4) 


(6 


5 


57 


6.0,52 


6.0,52,5.5,42,3 


5 


25,2 


5 


7,0 


5 


1) 


(7 





65 


7.0,64 


6.5,60,3.0,54,2 





28,0 


5 


5,0 





1) 


(7 


5 


71 


6.5,63 


4.5,57,4.5,41,2 





19,2 





6,0 





3) 


(7 


5 


72 


5.5,64 


5.0,56,5.0,43,3 





17,0 





4,0 





1) 


(8 





65 


8.0,57 


7.5,50,6.0,35,3 


5 


24,1 





8,0 





1) 


(8 


5 


35 


8.0,33 


7.5,31,6.0,26,4 





17,3 


5 


4,0 


5 


2) 


(8 


5 


69 


7.5,62 


6.5,56,5.5,45,5 


5 


24,3 


5 


3,0 





1) 


(8 


5 


70 


7.5,62 


7.0,56,6.0,40,4 


5 


20,2 


5 


2,0 





1) 


(9 





68 


8.0,60 


6.5,53,6.0,40,5 





12,1 


5 


1,0 


5 


1) 


(9 





69 


7.5,63 


6.5,54,5.5,43,4 


5 


14,1 





2,0 





1) 


(9 





70 


7.5,63 


6.5,55,6.0,43,5 





19,2 





1,0 





1) 


(9 


5 


67,7.5,60 


5.5,52,5.0,41,4 





12,2 





3,0 





1) 


(9 


5 


67 


8.0,61 


6.0,54,5.0,43,5 





19,2 





3,0 





1) 


(9 


5 


68 


8.0,62 


7.5,52,6.0,37,4 





16,1 


5 


2,0 





1) 


(9 


5 


71 


6.5,62 


6.0,52,3.5,43,2 


5 


27,2 





8,1 


5 


5) 


(10 


5 


36 


10.5,32,9.0,28,7.5,24, 


4. 


5,14,0.0,7, 


3.0,3) 



(10.5,58,10.0,49,10.0,47,8.5,33,7.0,15,3.5,4,0.5,2) 
(13.5,73,13.5,65,11.5,60,10.5,45,7.5,22,2.5,7,0.0,2) 
(13.5,74,13.5,67,11.5,64,10.5,37,6.5,14,1.5,7,0.0,2) 
(14.0,49,14.0,44,13.0,43,13.0,39,9.5,17,3.0,5,0.0,1) 
(14.0,58,12.0,52,11.5,47,10.5,33,7.5,14,3.0,8,0.0,5) 
(17.5,79,15.0,64,12.5,54,9.5,38,6.5,21,4.0,6,1.5,2) 
(18.5,81,14.5,65,13.5,55,10.0,40,7.0,24,3.5,8,1.5,2) 
(19.0,20,18.0,21,16.0,17,9.5,18,7.0,10,2.0,4,0.5,2) 
(19.0,55,18.0,50,17.0,45,14.0,34,8.0,18,2.0,5,0.0,1) 
(19.5,149,19.5,137,18.0,124,12.5,97,7.5,50,1.5,7,0.0,1) 
(19.5,180,18.5,161,16.0,135,14.0,77,10.0,28,1.0,8,0.0,2) 
(19.5,185,15.5,163,11.5,130,11.5,82,6.0,42,3.5,10,0.0,1) 
(20.0,43,18.5,38,15.5,31,13.5,22,9.5,14,7.0,6,2.0,4) 
(20.5,61,18.5,51,16.5,47,15.5,31,10.5,20,5.5,5,0.0,4) 
(21.5,46,17.0,38,15.0,33,13.5,23,9.5,14,4.5,5,2.0,2) 
(22.0,178,19.0,157,18.0,129,15.0,86,9.5,30,2.0,6,0.0,1) 
(23.0,178,20.0,160,18.0,134,14.5,82,11.0,34,2.0,9,0.0,2) 
(24.0,274,19.5,257,16.0,228,13.0,176,10.0,90,1.0,22,0.0,2) 
(26.5,171,24.0,151,20.5,134,16.5,105,12.5,52,3.0,16,1.0,1) 
(27.5,180,22.0,160,19.5,141,16.5,105,10.5,51,6.0,12,0.5,3) 
(36.0,102,28.5,94,26.0,81,20.0,58,12.5,27,6.5,9,2.0,2) 
(36.5,81,32.5,72,31.5,66,29.0,56,24.5,33,14.0,12,2.0,2) 
(36.5,145,34.0,130,27.5,124,25.0,92,15.5,37,3.5,6,0.5,1) 
(36.5,145,34.0,131,28.5,123,25.5,96,17.0,41,5.0,6,0.5,1) 
(38.5,141,36.0,126,30.5,117,27.0,90,19.0,39,4.5,6,0.5,1) 
(39.0,142,35.5,127,30.0,119,26.5,92,16.5,37,5.5,5,1.0,1) 
(41. 0,99, 30. 5, 76, 25. 5, 5 1,14. 0,30, 8.0, 19, 5. 5, 9, 0.5, 3) 



CATH domains 
2.60.120.20.4.3.1.2.2 and 2.60. 120.20.4.3. 1. 1 .n, 

for 2 < n < 24 and n ^ 3, 4, 10, 11, 12, 14 
2.70.98. 10.2.1.1. re. 1, for 3 < re < 17 and n ^ 9 
2.70.98. 10. 2. l.l.re.l, for 19 < n < 33 and n / 26, 30 



3. 20. 20. 70. 69. 3.1. re. 1, for 4 < 
3.75.10.10.1.2.2.71.1, for 1 
3.20.20.70.69.3.1.71.1, for 
3.40.50.510. 1.1. 1.1. rei. re, for 
3.90.70.10.3.2.1.771.71, for m.r 
1.10.490.10.5.1.1.771.71, for tti . 
1.10.490.10.4.1.1.771.71, for tti. 
2.60.40.10.2.1. 1.771. 7i, for m.n 
4. 10. 220. 20. 1.1. 2. re.] 
1.20.1070. 10.1. 1.1. rei. re, foi 
2.70.98.10.2.1.1.re. 
3.20.20.140.22.1.1 



re < 10 or re = 12, 15, 17 
< re < 6 or re = 8, 11 
re = 1, 2, 3, 13, 14, 16 
Tre.re=l.l,1.3,2.3,3.1 
■x = 2.15, 4.1, 5.1, 8.1, 9.1 
.re = 1.52, 1.53, 28.1, 28.2 
.re = 1.54, 1.55, 2.17, 2.18 
. = 1.258, 1.259, 7.23, 7.24 
1 , for re = 1 , 2 , 3 
r 771.71 = 1.12, 1.21, 9.1 
for re = 42, 44, 46 
1 , for re = 2 , 3 , 4 



2.10.210.10.1.1.1.1.1 and 1.10.8.10.13.1.1.1.2 
4.10.220.20.1.1.1.71.1, for re = 13, 15 
1.20.1500.10.3.1.1.71.1, for re = 1 , 2 
2.10.69.10.3.2.2.1.1 and 2.10.69.10.3.2.5.1.1 



1.20.1280. 10.1. 1.1. m. re 
1.20. 1280. 10. 1.1.1. m.n 
4.10.410.10. 1.1. 3. ( 
4.10.410.10.1.1.3.', 
2.10.25.10.20.2.1.-, 
1.10.1200.30.1.1.2.m.r 
3.30.70.270.4.1.1.771.71, 
1.10.238.10.3.1.2., 
2.40.70.10.3.1.1.771.71, 
1.10.760.10.6.1.1.T, 
2.30.30.140.3.1.1.771.71 



for 771. 
for 771. 

i.2, for 
i.l, for 
i.l, for 
, for 771 
for rei. re 
i.l, for 



i=l.l, 2.47 
i = 1.2, 2.48 
i = 4, 7 
i = 5, 8 
t=l 3 2 
n = 1.3, 4.1 
= 1.185, 2.1 
i = 5, 6 
= 5.6, 6.10 
= 1, 25 
i = 1.3, 2.1 



3.30.70.270.7.1.2.1.1 and 3.30.70.270.2.1.5.5.2 



3.30.365.10.4.1.1.771.71 
1.10.1040.10.4.1.1. 
3.30.1330.10.1.1.1. 
3.40.50.510.1.1.1.771.71 
3.30.1330.10.1.1.1. 
2.30.30.140.3.1.1.771.71 
3.40. 47. 10.8. 1.1. Ti 
3.40.47. 10.8. 1.1. 7-, 
3.40. 47. 10.8. 1.1. Ti 
3.40.47. 10.8. 1.1. 7-, 
3.40.47. 10.8. 1.1. 7-, 
3.40.47. 10.8. 1.1. r, 
3.40.47. 10.8. 1.1. r, 
3.40.47. 10.8. 1.1. r, 
3.40.420.10.2.2.4.1 
3.10.20.30.6.1.1.7, 
3.10.310.10.6.1.2.-, 
3.40.50.720.82.1.1. 
3.40.50.720.82.1.1. 
3.30.1330.40.2.1.1. 
3.10.310.10.8.1.1.-, 
2. 60. 120. 20. 9. 3. l.TTi. re, 
2.60.120.20.9.3.1.771.71, 
2.60.30.10.2.1.1.T, 
3.40.50.720.63-l.re 
3.20.20.110.1.1.3.71 
3.20.20.70.55.2.1.771.71 
3.20.20.70.55.2.1.771.71 
3.90.650.10.1.1.1.-, 
2.60.90.10.1.3.1.7, 
3. 90. 650. 10.1. 1.1. i 
3.20.20.70.55.2.1. 
3.20.20.70.55.2.1. 
1.10.620.20.6.1.1. 
3.40.718.10.4.6.1. 
3.40.718.10.4.6.1. 

3.50.50.60.55.1.: 
2.70.98.10.2.1.1. 
3.20.20.70.72.1.1. 
3.20.20.70.72.1.1. 
3.20.20.70.72.1.1. 
3.20.20.70.72.1.1. 

3.75.10.10.1.2.2. 



.7, for i 
.3, for i 
.5, for i 
v.l, for 
.1, for j 
i.l, for 
1.1, for 
1.1, for 
i.l, for 
1.1, for 

for 771.) 
for 771.1 

.1, for i 
1.1. for 
1, for 77 

for 771. 

for 771. 

■..1, for 
.1. for J 
■..1, for 



■..1, for 
1, for i 



% = 1.1, 2.2 

re = 1, 2 

re = 2 , 4 

7. = 2.4, 3.2 

re = 3, 5 

i = 1.4, 2.2 

. = 2,6 

. = 2,6 

. = 2,6 

. = 2,6 

. = 2,6 

. = 2,6 

, = 2,6 

. = 2,6 

i = l,2 

. = 2,4 

i = l,2 

re = 4, 9 

re = 2, 6 

re = 1, 3 

i = 6,7 

. = 1.15, 6.1 

, = 1.18, 6.2 

, = 7, 9 

71 = 1 , 2 

= 11, 13 
i = 5.8, 7.4 
i = 5.5, 7.1 
i = 3,5 
. = 1,3 
i = 2,4 
i = 5.6, 7.2 
i = 5.7, 7.3 
, = 1.2, 2.48 
-i = 1.4, 3.2 
-t = 1.3, 3.1 
i = 7, 9 

= 9, 18 
i = 3.8, 5.4 
i = 3.6, 5.2 
i = 3.7, 5.3 
i = 3.5, 5.1 

= 7, 10 



Wc regard Results lKTI to lKSl as topological classifications of protein domains in the spirit 
of topology determining geometry as is familiar from rigidity results for three-dimensional 
manifolds for example. 

6. Closing remarks 

The fatgraph corresponding to a polypeptide structure defined here, and its gener- 
alizations discussed in Section 13.41 is based on the intrinsic geometry of a protein at 
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Table 5.2. Exceptions to injectivity in Result [5721 



Invariants 



CATH domains 



(49,45 


16 


,0 


.0,4 





.0 


0.0 





,46) 


1.20.5.190.1.1.2.1.4, 1.20.5.530.1.1.1.1.2, 1.20.5.170.1. 


(56,51 


,52 


,0 


0,4 


.0 


,0 


,0,0 


,0 


,53) 


1.20.5.190.1.1.3.1.1, 1.20.5.500.1.1.1.1.3, 1.20.5.170.9. 


(42,38 


,39 


,0 


0,4 





,0 


,0,0 


,0 


,39) 


1.20.5.190.1.1.3.2.1, 1.20.5.170.3.1.1.1.12 


(46,31 


,30 


,1 


.0,5 


.0 


.2 


,3,0 


.2 


,39) 


1.10.60.10.3.1.1.1.2, 1.10.287.680.1.1.1.1.16 


(49,43 


44 


,0 


0,4 


1 


,0 


,0,0 


,0 


,46) 


1.20.5.300.2.1.1.1.7, 1.20.5.170.2.2.1.1.6 


(49,25 


24 


,1 


.0,6 





.6 


.3,1 


,5 


,35) 


1.10.10.60.32.1.1.1.42, 4.10.51.10.1.1.1.1.25 


(50,45 


,46 


,0 


0,4 


.0 


,0 


,0,0 


,0 


,47) 


1.20.5.80.2.1.1.2.2, 1.20.5.170.2.2.1.1.2 


(52,48 


,49 


,0 


0,4 





,0 


,0,0 


,0 


49) 


1.20.5.530.1.1.1.1.1, 1.20.5.170.2.1.1.1.2 


(52,32 


,33 


,0 


.0,5 


.0 


,6 


,1,2 


,3 


,40) 


4.10.220.20.1.1.1.1.1, 1.20.5.810.3.1.1.7.1 


(53,30 


,27 


,2 


.0,6 





,5 


.6,1 


4 


,41) 


1.10.1220.10.3.1.3.1.3, 1.10.890.20.1.1.1.1.3 


(59,55 


,56 


,0 


0,4 





,0 


,0,0 


,0 


,56) 


1.20.5.500.1.1.1.1.2, 1.20.5.170.10.1.1.3.1 


(60,56 


,57 


,0 


0,4 


.0 


,0 


,0,0 


,0 


,57) 


1.20.5.500.1.1.1.1.1, 1.20.5.170.10.1.1.3.2 


(62,58 


,59 


,0 


0,4 





,0 


,0,0 


,0 


,59) 


1.20.5.170.6.1.1.2.1, 1.20.5.110.6.1.1.2.3 


(64,58 


,59 


,0 


0,4 


.1 


,0 


,0,0 


,0 


,61) 


1.20.5.300.1.1.1.1.2, 1.20.5.170.6.1.1.1.8 


(65,37 


,35 


,1 


.5,5 


7 


.9 


,5,2 


,7 


46) 


1.10.8.200.1.1.1.2.1, 1.10.2030.10.1.1.1.1.8 


(72,48 


,46 


,1 


.5,5 


1 


,7 


,3,2 


,5 


,57) 


1.10.40.30.1.1.2.1.6, 1.10.220.10.8.1.1.1.2 


(79,75 


,76 


,0 


0,4 





,0 


,0,0 


,0 


,76) 


1.20.5.170.16.1.1.1.5, 1.20.5.110.7.1.1.2.1 


(88, 60, 53, 4.0, 5. E 


.,10. 


11," 


1,6,69) 


1.10.238.10.9.2.1.1.10, 1.10.288.10.2.1.1.1.1 



(05,51.d2J>.r,,7.0,; : ;H.23.2(),ll,43) 



3.30.1050.10.5.1.1.1.6, 3.30.1490.70.4.1.1.1.2 



Table 5.3. Exceptions to injectivity in Result [531 where N fc denotes 
k > 1 consecutive N 



Flip Sequence 

^ 



CATH domain 



^46 



1.20.5.460.1.1.1.6.1, 1.20.5.110.15.1.1.1.1 
1.20.5.800.1.1.2.1.1, 1.10.10.380.1.1.1.1.1 
1.20.5.140.3.1.1.1.1, 1.20.5.420.5.1.1.1.1, 1.20.5.170.18.1.1.1.1 
1.20.5.700.1.1.1.1.1, 1.20.5.100.2.1.1.1.1 

1.20.5.770.1.1.1.1.1, 1.20.5.700.1.1.1.1.3 
1.20.5.40.1.1.2.1.6, 1.20.5.80.2.1.1.2.5 

1.20.5.440.1.1.1.1.1, 4.10.810.10.1.1.1.1.1, 1.20.5.170.8.1.1.1.5 
1.20.5.190.1.1.3.2.1, 1.20.5.170.3.1.1.1.12 
1.20.5.430.1.1.2.1.3, 1.20.5.80.2.1.1.1.3, 1.20.5.490.1.1.1.1.1 
1.20.5.240.1.2.1.1.1, 1.10.930.10.1.1.2.1.2, 1.20.5.170.3.1.1.1.1 

1.20.5.230.1.1.1.1.1, 1.20.5.80.1.1.1.1.2 
1.20.5.190.1.1.2.1.5, 1.20.5.300.2.1.1.1.12, 1.20.5.170.14.1.1.1.1 
1.20.5.300.2.1.1.1.9, 1.10.287.300.1.1.1.1.1 
1.20.5.190.1.1.2.1.4, 1.20.5.530.1.1.1.1.2, 1.20.5.300.2.1.1.1.7, 1.20.5.170.1.1.2.1.1 
1.20.5.190.1.1.1.1.2, 1.20.5.80.2.1.1.2.1, 1.20.5.300.2.1.1.1.1, 1.20.5.170.2.2.1.1.1 
1.20.5.190.1.1.2.1.1, 1.20.5.170.2.2.1.1.11, 1.20.5.110.2.1.1.1.3 
1.20.5.290.1.1.1.1.1, 1.20.5.530.1.1.1.1.1, 1.20.5.170.2.1.1.1.2, 1.20.5.110.14.1.1.1.1 
1.20.5.190.1.1.5.1.1, 1.20.5.370.2.1.2.1.1, 1.20.5.170.10.1.1.1.1 
1.10.287.750.1.1.8.1.1, 1.20.5.170.2.2.1.2.2, 1.20.5.110.11.1.1.1.1 
1.20.5.170.2.2.1.2.1, 1.20.5.110.10.1.1.1.1 
1.20.5.190.1.1.3.1.1, 1.20.5.500.1.1.1.1.3, 1.20.5.170.4.1.1.1.1 

1.20.5.300.1.2.1.1.2, 1.20.5.110.5.1.1.1.2 
1.20.5.500.1.1.1.1.2, 1.20.5.170.10.1.1.3.1, 1.10.287.130.2.1.1.1.6 

1.20.5.390.1.1.1.1.1, 1.20.5.500.1.1.1.1.1, 1.20.5.170.10.1.1.3.2, 1.20.5.110.8.1.1.1.1 
1.20.5.620.1.1.1.1.1, 1.10.287.230.1.1.1.1.2, 1.20.5.170.4.2.1.1.1, 1.20.5.110.5.1.1.1.1 

1.20.5.300.1.1.1.1.1, 1.20.5.170.4.1.1.2.2, 1.20.5.110.6.1.1.2.3 
1.10.287.210.2.2.1.8.1, 1.20.5.170.6.1.1.1.11, 1.20.5.110.3.1.1.1.1 

1.20.5.300.1.1.1.1.2, 1.20.5.170.6.1.1.1.8, 1.20.5.110.4.1.1.1.1 
1.20.5.500.1.1.1.1.4, 1.20.5.170.5.1.1.1.1 

1.10.1440.10.1.1.1.1.1, 1.20.5.170.5.1.1.1.2, 1.2.5.110.6.1.1.1.1 
1.20.5.730.1.1.1.1.1, 1.20.5.170.6.1.1.1.3, 1.20.5.110.2.1.1.1.1 
1.20.5.400.1.1.1.1.1, 1.10.287.210.2.2.1.4.4, 1.20.5.110.6.1.2.2.2 
1.10.287.210.2.2.1.4.3, 1.20.5.170.16.1.1.1.3, 1.20.5.110.6.1.2.2.3 
1.20.5.340.1.1.1.1.4, 1.20.5.110.7.1.1.4.3 
1.10.287.210.7.1.1.1.1, 1.20.5.170.16.1.1.1.4 
1.20.20.10.1.1.1.1.3, 1.20.5.340.1.1.1.1.3, 1.20.5.170.16.1.1.1.5, 1.20.5.110.7.1.1.2.1 
1.10.287.660.1.1.1.2.1, 1.10.287.230.1.1.2.1.5, 1.10.287.750.1.1.6.1.1 
1.20.5.170.5.1.1.2.1, 1.20.5.110.6.1.1.2.1 



v 61 
C 62 



N 



N 



77 



FN* 
N 2 FN G1 
V 2 ^ FN 2 ® 
V 29 FJV 24 

31 FiV 26 
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N 
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vr43 f 



1.10.287.230.1.1.1.4.1, 
1.10.287.230.1.1.2.1.4, 
1.10.287.750.1.1.3.1.1, 
4.10.81.10.2.1.1.1.1 
1.20.5.490.1.1.1.1.3 



1.10.287.210.2.1.2.1.3 
1.10.287.750.1.1.5.1.1 
1.10.287.210.2.2.1.7.1 
1.20.5.50.9.1.1.1.8 
1.20.1070.10.7.1.1.1.2 



1.10.10.200.2.2.1.1.1, 1.20.5.170.15.1.1.1.1 
1.20.5.170.10.1.1.2.1, 1.10.287.190.1.1.1.1.2 



equilibrium. We believe that we have just scratched the surface of defining meaningful 
protein descriptors derived from robust invariants of these fatgraphs in this paper, whose 
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primary intent is simply to introduce these methods. Further applications are either on- 
going or anticipated, and we briefly discuss aspects of these various projects in this closing 
section. 

Recall from Section 13.41 that rotamer fatgraphs arise from our basic fatgraph model 
of a polypeptide structure by refining the simplest discretization of the backbone graph 
connection. Such a rotamer fatgraph or invariants of it may be assigned to the subse- 
quence of a protein corresponding to a turn or coil in order to give a new classification of 
these structural elements. Construction 13.91 associates matrices to hydrogen bonds thus 
providing new tools for their analysis, for example, discretizations likewise providing new 
classifications of hydrogen bonds. 

More generally, the fatgraph or rotamer fatgraph of a protein or protein domain and ro- 
bust invariants of it provide new descriptors which can be used to refine existing structural 
classifications. A key attribute of these new descriptors, as exemplified by the injectivity 
results in Section f5. 21 is that they are automatically computable from PDB files without 
the need for human interpretation into the usual architectural motifs. In a similar vein, 
[27] associates protein descriptors inspired by quantum invariants of links, which are dif- 
ferent from the quantum invariants proposed in Section [3.41 and proves injectivity results 
analogous to those in Section 15.21 In contrast to [27] where the geometric or topological 
meaning of the descriptors is unclear, the significance of our descriptors such as those 
considered in Section f5. 21 is manifest. 

The recent paper [5] studies probability densities on the space of conformational angles 
with applications to structure prediction, and densities on the Lie group 50(3) can be 
computed and applied to structure prediction in an analogous manner. Furthermore, 
the prediction of corresponding discretizations such as the flip sequence and its rotamer 
analogues from protein primary structure has already proved interesting. 

Appendix A. Alternative description of the model 

There is another representative K(P) of the equivalence class of the fatgraph G(P) 
associated to a polypeptide structure P which we shall describe in this appendix. In some 
ways, the alternative description is more natural though Corollary 13.121 is true but not 
obvious in this formulation. 




Figure A.l. Fatgraph building blocks for the alternative model 



The backbone is modeled as the concatenation of fatgraph building blocks, one such 
building block for each peptide unit. The two possible building blocks for the ith peptide 
unit are illustrated in Figure IA.1I and are called the positive and negative configurations 
corresponding to whether the oxygen atom Ot lies to the left or right of the backbone, 
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respectively, when traversed from N to C termini. The model of the backbone is deter- 
mined by the sequence of configurations, positive or negative, assigned to the consecutive 
peptide units and is thus described by a word of length L — 1 in the alphabet {+, — }, 
which is called the plus/minus sequence of the polypeptide structure. The untwisted fat- 
graph Y(P) which is an alternative model of the backbone is constructed from this data 
by identifying endpoints of the consecutive horizontal segments of the fatgraph building 
blocks in the natural way as before. There is an arbitrary choice of configuration ci = + 
for the first building block as positive. 

Suppose recursively that configurations ci G { + , — } have been determined for i < i < 
L. The configuration Ci is calculated from the configuration Oj_i as follows: 

+Ci_i, if Vi-i ■ Vi + Wi-i ■ Wi > and Ri is not cis — Proline; 

— Cj_i, if Vi-i ■ Vi + ■ Wi < and Ri is not cis — Proline; 

Ci = < 

—Ci-i, if Vi-i ■ Vi + Wi-i • Wi > 0, and Ri is cis — Proline; 

,+c,:_i, if Vi-i ■ Vi + Wi-i ■ Wi < 0, and Ri is cis — Proline, 

completing the construction of the alternative backbone model Y(P). Notice that the flip 
sequence uniquely determines the plus/minus sequence and conversely. 

As in Construction 13.81 if (i, j) £ B in input ii), then we add an edge to Y(P) connecting 
the short vertical segments corresponding to the atoms Hi and Oj. To complete the 
construction of K(P), it remains only to specify which edges of the resulting fatgraph 
are twisted. To this end, suppose that G B in input ii). There are corresponding 

3- frames 

Ti-\ = (ui-i,Vi-i, Wi-i), 
Tj = (uj,Vj,Wj), 

from Construction I3~21 and corresponding configurations Ci_i and Cj defined above. An 
edge corresponding to the hydrogen bond (i, j) G B is taken to be twisted in K(P) if and 
only if Ci-iCj sign(iTi_i • Vj + Wi-i ■ Wj) is negative. 




Figure A. 2. Elementary equivalences of fatgraphs 

The proof that K(P) and G(P) are equivalent depends upon the following simple 
diagrammatic result. 

Lemma A.l. The fatgraphs depicted in Figures \A.2k , and \A.3j > are strongly equivalent, 
and the fatgraphs depicted in Figures \A.2fc , YA~J3d . and \A.0e are pairwise equivalent. 



FATGRAPH MODELS OF PROTEINS 



31 



Proof. The strong equivalence of lA.2a and I A, 2b is proved directly. Perform vertex flips on 
the vertices labeled u, w in [Ob and erase pairs of icons x on common edges to produce 
IA.2H , which is strongly equivalent to lA.2fe according to the first assertion. □ 

Proposition A. 2. The fatgraphs G(P) and K(P) are equivalent. 

Proof. The underlying graphs of G(P) and K(P) are isomorphic by construction. Fur- 
thermore, recursive application of Lemma IA.1I shows that there is a sequence of vertex 
flips starting at T(P) and ending at Y(P), so the two backbone models are equivalent by 
Proposition ^. 41 We claim that an edge of G(P) representing a hydrogen bond is twisted if 
and only if it the corresponding edge of K(P) is twisted, and there are two cases depend- 
ing upon the parity of the number of twisted alpha carbon linkages of G(P) between the 
endpoints of such an edge. This number is even, and hence so too is the number of icons x 
on the edge, if and only if the configurations of fatgraph building blocks in K(P) at these 
endpoints agree, and the claim therefore follows by definition of twisting in K(P). □ 




J I 



anti-parallel beta strands 
I I |__| 




parallel beta strands 




Figure A. 3. Alpha helices and beta strands 



We finally consider how the standard motifs of protein secondary structure are manifest 
in our alternative model K(P). The illustration on the top of Figure [A. 31 depicts our 
fatgraph model of an alpha helix, which is defined by the indicated pattern of hydrogen 
bonding. It is well-known for proteins [9] that the plus/minus sequence of an alpha helix 
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is given by a constanlQ plus/minus sequence + + + + +or — — — — — . Indeed, this 
is the standard graphical depiction of an alpha helix in the protein literature, but for us, 
there is the deeper meaning of the figure as a fatgraph rather than simply as a graph in 
its usual interpretation. The dotted line indicates a typical boundary component of the 
corresponding surface. 

The second and fourth illustrations from the top in Figure IA.3I depict our fatgraph 
models of an anti-parallel beta strand and a parallel beta strand, respectively, which are 
again defined by the indicated pattern of hydrogen bonding and the orientations along 
the backbone from the N to C termini indicated by the arrows in the figure. Again, it 
is well-known for proteins [5] that a beta strand, whether parallel or anti-parallel, has an 

alternating 7 plus/minus sequence H 1 \- or 1 1 — . Again, these are the standard 

graphical depictions of beta strands but now with our enhanced fatgraph interpretation, 
and the dotted lines indicate typical boundary components of the corresponding surface. 

Consider the effect of a change of single configuration type in the plus/minus sequence, 
from + to — or — to +, on the backbone between these two backbone snippets as depicted 
in the third and fifth illustrations from the top in Figure lA"!3l It follows from the definition 
of twisting in K{P) that the vertical edges corresponding to hydrogen bonds will now be 
twisted. The boundary components in the second and fourth illustrations from the top 
persist in the third and fifth illustrations, respectively, in accordance with Corollary 13.121 
Indeed, an odd number of changes of configuration types in the backbone between the 
two backbone snippets will produce the analogous result, and an even number leaves the 
figure unchanged. 

Let us also clarify a point about anti-parallel beta strands. It is not necessarily the case 
that the second and third illustrations from the top in Figure IA.3I accurately depict our 
fatgraph model of an anti-parallel beta strand: it may happen that our model produces 
the second figure but with twisted edges representing the hydrogen bonds in the strand 
or the third figure without such twisting. This is because the determination of twisting in 
K(P) depends upon the sign of cc'(v ■ if + w ■ w'), where (it, v, w) and (u ,v',w') are the 
3-frames of the peptide units with configurations c and c' corresponding to the endpoints 
of the edge. Though the oxygen and hydrogen atoms involved in the hydrogen bond are 
within a few Angstroms, the configurations c, c may not reflect this, and furthermore, the 
sign of cc' (v • if +w-w') depends not only on c and c' , but also on both of v- if and w-w' . 
This leads naturally to the notion of "untwisted anti-parallel beta strands" , namely, those 
for which Figure lA"!3l is accurate, and "twisted anti-parallel beta strands", those for which 
it is not. In contrast, alpha helices and parallel beta strands are always represented as in 
Figure QO] 

In short, the passage from graph to fatgraph enhances the usual graphical depiction of 
alpha helices and beta strands. Changes of configuration type away from the alpha helices 
and beta strands leaves undisturbed the boundary components of the surface associated 
to the fatgraphs which model them. Furthermore, the distinction between twisted and 
untwisted anti-parallel beta strands is new and depends upon modeling the backbone as 
a fatgraph rather than merely as a graph. 
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